PubChem Help            PubChem FAQ

This document provides tips and examples for searches of the three PubChem databases by text term/keyword, as well as tips for searching PubChem Compound by chemical properties. The Structure Search Help document provide tips on using chemical information for basic and advanced searches in the PubChem Structure Search tool. In addition, the PubChem Upload Help document provides procedures and instructions on how to deposit your structure/assay data into the PubChem system using the PubChem Upload tool. The PubChem Download Facility Help document describes how to use the PubChem Download Facility.

 

PubChem Overview

back to top

PubChem provides information on the biological activities of small molecules.

PubChem includes substance information, compound structures, and BioActivity data in three primary databases, Pcsubstance, Pccompound, and PCBioAssay, respectively.

  • Pcsubstance contains more than 180 million records. You can check the count of substance records as of today.

  • Pccompound contains more than 63 million unique structures. You can check the count of compound records as of today.

  • PCBioAssay contains more than 1 million BioAssays. Each BioAssay contains a various number of data points. You can check the count of BioAssay records as of today.
The Substance/Compound database, where possible, provides links to BioAssay description, literature, references, and assay data points. The BioAssay database also includes links back to the Substance/Compound database. PubChem is integrated with Entrez, NCBI's primary search engine, and also provides compound neighboring, sub/superstructure, similarity structure, BioActivity data, and other searching features.

PubChem contains substance and BioAssay information from a multitude of depositors. You can check the PubChem data source status as of today. 

PubChem Substance Database

back to top

The PubChem substance database contains chemical structures, synonyms, registration IDs, description, related urls, database cross-reference links to PubMed, protein 3D structures, and biological screening results. If the contents of a chemical sample are known, the description includes links to PubChem Compound.

 Query Examples:

Query Results:

 

PubChem Compound Database

back to top

The PubChem Compound Database contains validated chemical depiction information that is provided to describe substances in PubChem Substance.

Structures stored within PubChem Compound are pre-clustered and cross-referenced by identity and similarity groups. Additionally, calculated properties and descriptors are available for searching and filtering of chemical structures.

Users can perform a term/keyword search in a same manner as for substance database (see above). In addition, the PubChem compound database also provides a chemical property search.

Examples:

The PubChem Compound Limits page provides a very useful way to rapidly perform complex searches. All search examples showed above can be done at the Limits page. Go to the Limits page to begin any of the examples below.

Examples:

Query Results:

 

PubChem BioAssay Database

back to top

The PubChem BioAssay Database contains BioActivity screens of chemical substances described in PubChem Substance. It provides searchable descriptions of each BioAssay, including descriptions of the conditions and readouts specific to a screening protocol.

Query Help:

Query Results:

PubChem Summary and Analysis

back to top


The PubChem results are displayed in three category pages: substance, compound, and BioAssay pages. They provide rich cross links to each PubChem database, other NCBI databases, and depositor's databases. PubChem's default results page is part of the Entrez summary list display system.

Substance Summary:
From the Entrez PubChem substance database, users can get substance summary with thumbnails, corresponding compound ID, depositors source information, etc. You can see an example of a substance result in Entrez.



On this page, users can choose to display brief, summary, ID map, substance neighboring information, synonyms, and other database information from the dropdown list. On the right of the page, users can select few pop-up windows (when available) to get related structure, BioAssay, and literature links related to this substance. Users can choose to either "display", or "send" the searched results to "text" or to a "file".

Users can find the more detailed substance information and cross links by clicking the structure image or the ID link. Here is an example of the PubChem Substance Summary page:



This page displays the depositor provided original information, such as substance information, deposited structure drawing, older version selection, comments, etc. Users can also find some derived information, links if available.

Click the associated "Chemical Structure" tab to display the standardized compound information including property data, other depositor provided synonyms, descriptors, cross links, PubChem standardized structure drawing, etc.
Power users can even download different data formats, such as ASN.1, XML, and SDF. 

Compound Summary:
All compounds have been extracted from deposited substances. For natural products substances and those don't have structures, there will be no compound records associated. A substance that is in form of mixture has the mixture format compound record and a/few component(s) compounds associated with.

From the Entrez PubChem compound database, users see a compound summary with thumbnails, few compound property data, etc. Here is an example of a compound result in Entrez.

The page is in the same style as substances. Clicking on thumbnails or CID hyperlink will lead users to the Compound Summary page. Users can find this compound's property data, description, related substance information, neighboring structures, and cross links.

All compounds are structurally unique when compared with each other. One compound may link to many substances.

 

Substance/Compound Summary Content:

Title shows chemical name and PubChem accession identifier. The toolbar contains icons that allow users to launch: a bioactivity summary , when bioactivity is available; a chemical structure search, to search by identity, similarity, super/sub-structure, or molecular formula; 3D conformer launch tool when a conformer is available; or data download in various formats, including the native PubChem archive format ASN.1 , XML , or the industry standard SDF format .


BioMedical Annotation:
Content in this section is provided by the NLM MeSH resource. MeSH is the U.S. National Library of Medicine's controlled vocabulary used for indexing articles for MEDLINE/PubMed. MeSH terminology provides a consistent way to retrieve information that may use different terminology for the same.

In substance page (deposited record), the BioMedical annotation is derived from the MeSH resource by matching deposited synonyms. In compound page (chemical structure), the information is derived from combined synonyms with a name weighing algorithm.
 
This section also contains medication information (from NLM DailyMed),  pharmacological action, drug and chemical classification, safety and toxicology, and pubmed linking information, when available.

Safety and Toxicology
Content in this section is provides from the NLM toxnet.

BioAssay Results:
Content in this section is provided by the PubChem BioAssay database. A summary of available results is provided. A launch point for bioactivity summary analysis is provided for the current compound or the current compound including similar compounds. To view all contributed BioAssays, click the "more..." link.

Synonyms:
Content in this section includes synonyms provided by depositors. "Unfiltered" synonyms are all the synonyms provided by depositors. "Filtered" synonyms are synonyms that have intra-depositor and inter-depositor consistencies. The substances assigned to a synonym need to be consistent at any of the following levels (high to low): exact same structure, same stereo form, same connectivity, same parent structure, same parent stereo form, or same parent connectivity. The order of filtered synonyms are sorted based on consistency level, frequency, and readability score, while unfiltered synonyms are sorted by frequency and readability score only. The frequency is the number of times a synonym is provided by depositors for a particular compound structure. Most commonly used synonym(s) show first. For substances, the frequency of synonyms is always 1. The readability score is determined by the size of the synonym, the count of non-alphabetic characters, and capitalization, etc. A MeSH tree icon indicates synonyms that are known to MeSH. Sorting and display controls are available. By default, only the first ten synonyms are shown.

Properties:
Content in this section includes computed properties of the compound record. A list of properties are below but include various counts.

- 2D compound properties:

Molecular Weight -
Molecular Formula -
XLogP -
H-Bond Donor -
H-Bond Acceptor -
Rotatable Bond Count -
Tautomer Count -
Exact Mass -
MonoIsotopic Mass -
Topological Polar Surface Area
Heavy Atom Count
Formal Charge
Complexity
Isotope Atom Count
Defined Atom StereoCenter Count
Undefined Atom StereoCenter Count
Defined Bond StereoCenter Count
Undefined Bond StereoCenter Count
Covalently-Bonded Unit Count

- 3D conformer properties:

Feature 3D Acceptor Count -
Feature 3D Hydrophobe Count -
Feature 3D Ring Count -
Effective Rotor Count -
Conformer Sampling RMSD -
CID Conformer Count - Conformer count for a given compound.

Descriptors:
Content in this section includes computed descriptors of the compound record. A list of descriptors is below.

IUPAC Name -
Canonical SMILES -
InChI -
InChiKey -

Compound Info:
Content in this section is provided by the PubChem Compound database. The PubChem Compound accession identifier (CID) is provided with the date the CID was created and, if a mixture, the parent compound CID (when applicable) and a link to the unique components comprising the compound record. Links to related compounds (when applicable), with varying degrees of identity (e.g., being different by isotopic or stereochemical means), and 2D chemical similarity are provided.

Substance Info:
Content in this section is provided by the PubChem Substance database. When viewing a substance record, this section contains the PubChem Substance accession identifier (SID) along with the dates the SID was first created and last updated by the depositor. When the substance can be linked to a unique compound record, the PubChem Compound accession identifier (CID) is provided along with the date the CID was created, the parent compound CID (when available), and a link to the unique components comprising the compound record.

When viewing a compound record, this section contains links to all related PubChem substance records, being either the same compound or contain the compound as a part of a mixture. Substance Categorizations are also provided to help you identify useful resources provided by PubChem depositors.

Structure & Quick Link Bar:
Content in this section includes 2D structure depiction, 3D conformer image (if available, toggled with 2D depiction), Pc3D application download link, frequently used compound property data, and links. Note: this part of content can be collapsed by clicking the "bar" on the top and expanded back by clicking the same bar (vertical). Double click the long thin vertical area (left side of the quick bar, light grey color, the mouse cursor will change to "+" when mouse over) perform the same function.

  


Structure Clustering for Compounds/Substances:

The compounds/substances are clustered based on the structure similarity using the Single Linkage algorithm. The structure similarity is either the Tanimoto score calculated from the 2D structure fingerprint, or the 3D shape/feature similarity. The 3D coordinates are theoretically calculated. For 2D structure analogs, a Tanimoto score of 0.68 or greater is statistically significant at the 95% confidence interval. For 3D structures, a similarity score is statistically significant at the 95% confidence interval as such: 3D Shape + Feature :: using 1 conformer is 0.88, using 10 conformers is 1.03; 3D Shape ST-Optimized :: using 1 conformer is 0.74, using 10 conformers is 0.85; 3D Feature CT-Optimized :: using 1 conformer is 0.30, using 10 conformers is 0.39. Both the simple view with the compound/substance IDs and the view with the structures are provided. The limit of compounds is 4000 for 2D structure analogs, and 1000 for 3D structures. If more compounds are input, a warning message will show up.

3D Conformer Similarity

Each compound may have up to 10 calculated conformers. If the compound has no 3D conformers but its parent compound has, the parent compound will be used in calculating the 3D conformer similarity. When the compounds are clustered, you can Choose Conformer Pairs by "Most Similar" or "All". "Most Similar" means the clustering is on compounds, and the most similar conformer pairs are used to represent a pair of compounds when a set of conformers for compound A and another set of conformers for compound B are compared. "All" means the clustering is on conformers.

During the 3D similarity calculation, 3D Superposition is Optimized by either "Shape" or "Feature". If the calculation is shape-optimized, the 3D similarity can be represented by the sum of "Shape" and "Feature" similarity scores, or just "Shape" similarity score. If the calculation is feature-optimized, the 3D similarity can be represented by the sum of "Feature" and "Shape" similarity scores, or just "Feature" similarity score.

A certain Number of Conformers per Compound (nconf) is chosen to finish the calculation in 1-2 minutes. This number "nconf" is also shown in the clustering image. You can increase this number up to 10 to use more conformers in the calculations.



Collapse Compound Cluster: The Compound Cluster Tree can be collapsed if you click on the ruler as shown below. The subtrees beyond the collapsed Tanimoto score will be collapsed into a node, which can be expanded.


Select from Cluster: As shown in the following image, if you click on a blue circle in the Cluster Tree, a menu will pop up. The options for 2D clustering are "Compounds in Entrez", "Compounds in BioActivity Analysis", "Structure Similarity Scores", "Expand Subtree", and two Revise Selections: "Display Subtree Only" and "Remove Subtree & Display the Rest". There are two more options for 3D clustering: "Compounds in 3D Viewer" and "Conformers Used".

Common Substructures: As shown in the following image, if you mouseover a node (blue circle) or the line on its left, the common substructures for the compounds in the subcluster will pop up. Currently only the 2D common structures are shown. If the similarity of the node is >= 0.9, the common fingerprint bits greater than 574 are shown. Otherwise, the fingerprint bits greater than 713 are shown.

Export:

Similarity Data: You can export Structure Similarity Scores used to generate the dendrogram.

Conformers Used: This button will appear only when you choose "Most Similar" conformer pairs to calculated the 3D Shape/Feature similarity for each pair of compound. You can export the selected conformer pair for each compound pair.

Image: You can export the display in one full PNG image since the display may consist of many small images.

Clusters in GML: You can export the clusters as a Graph Modelling Language (GML) file, which can be viewed in other softwares such as Cytoscape. The GML file format can be easily converted to other formats such as the eXtensible Graph Markup and Modeling Language (XGMML), Graph eXchange Language (GXL), and GraphML.

Result Display Option - Group Results by: You can switch between "Compound" and "Substance" views. These compounds are grouped from these substances.
Save View: is defined below.



PubChem BioActivity Services back to top

Common gateway of PubChem BioActivity Analysis Service. It provides a central entry point for accessing bioassay records, and tools including BioAssay Summary, BioActivity Summary, Data Table and Structure-Activity Analysis for selected substance/compound/assay set. Data Table further has services for data analysis through Plots and for Selecting detailed test results. Functionality and navigation of these services are documented below.
Files saved for recording analysis status can be imported using the "Open Saved View" tab. The chemical structure clustering tool launch point is also in this page. [Ref: Nucleic Acids Res, 2009; (6).]



BioAssay Summary: The BioAssay Summary service allows one to review the information content of PubChem BioAssay records, including information provided by assay depositors as well as annotations provided at PubChem. To retrieve a specific bioassay record, please provide the PubChem BioAssay accession, AID.

BioActivity Summary: The BioActivity Summary service reports the available biological screening results for a single or a set of chemical samples. This service provide means for one to examine and compare biological outcomes across multiple biological tests. Please specify compounds using the given input methods. To retrieve specific test results, please specify bioassays.

BioActivity DataTable: The Data Table tool supports rapid search and retrieval of test results for a single or multiple bioassay records. Please specify bioassays using the given input methods.

Structure-Activity Analysis: The Structure-Activity Analysis service clusters compounds and bioassays simultaneously using chemical structure, biological outcome, and target information. This service provides exploratory tools that allows one to identify structure-activity relationship and examine target selectivity and specificity of a compound. Please specify compounds and bioassays using the given input methods.

Structure Clustering: Chemical Structure Clustering Tool clusters compounds/substances based on the structure (fingerprint) similarity using the Single Linkage algorithm. Please specify compounds/substances using the given input methods.

Open Saved View: The launch point for the saved view file. A "view file" can be saved from BioActivity Summary, BioActivity Datatable, Structure-Activity Analysis, and Chemical Structure Clustering pages. For more information about a view file, click here.


Display: Allow users to switch compound and substance input.

Compound Input: Allow users to specify compound input. Users can choose to use only one input method: search term, CID list, CID list file, or select an entrez history key (if available).

Substance Input: When select substance input, users can specify substance input using search term, or SID list, or SID list file, or select an entrez history key (if available).

BioAssay Input: Allow users to specify the bioassay input. Users can choose to use only one input method: search term, AID list, AID list file, or select an entrez history key (if available).

UID List: A UID (here refers to CID, SID, or AID) list should be in form of a comma separated numeric list. Delimiters can also be space, semicolon(;), new line, tab. For SID input, users can choose to use the ID-Map file which can be obtained from the pcsubstance docsum page.


BioAssay Summary:

BioAssay Summary may be accessed through NCBI Entrez system, where one can search the PubChem BioAssay database using a specified key word. Users can see an example of Entrez BioAssay search result for the term "peroxiredoxins".



Using the "Display" pull-down menu in this page, users may choose to view lists of summaries, brief summaries, unique identifiers, compounds, substances, free text article links (via PMC), and PubMed citations. On the right of the page, users can select few pop-up windows (when available) to get Related BioAssays, Compounds, Literature, etc.

Clicking on AID hyperlink will lead users to the BioAssay Summary page.

This page shows detail descriptions of a BioAssay including citation links, experiment protocols and depositor comments. "Data Table(Active)" links to test results for compounds considered active in the particular BioAssay, while "Data Table(All)" links to the complete test results. This page also provides links to a few data analysis resources/tools that are derived at PubChem, such as "BioActivity Summary", "Related BioAssay", and etc. The bottom of the page shows detailed readouts, such as name, descriptions and data type. "Test Concentration" and "Active Concentration" attributes are flagged with * and **, respectively. The glossary of this page is listed below.

AID: PubChem's BioAssay identifier.

BioAssay Version: The BioAssay version number is composed of major version number and minor version number. We encourage you to look at the current version result as it is the updated data from the depositors.

Name: The BioAssay name provide by the depositor.

Data Source: Depositor's source name (unique in PubChem)

Deposit Date: Date when data was first deposited.

Modify Date: Date when data was revised.

BioAssay Results: Data table for active substance or all substance.

BioActive Compounds: Active compounds/substances tested in the BioAssay. Related links for the compound/substance set.

Related BioAssays: Related BioAssays by activity overlap, target similarity, and/or related to the same tested compound/substance set.

Protein Target: Protein target related to this BioAssay.

Links: Extra linked information to this BioAssay.

Compounds: Compounds tested for this BioAssay, including activity information.

Substances: Substances tested for this BioAssay, including activity information.

PubMed: PubMed citations related to this BioAssay.

Nucleotide: NCBI Entrez Nucleotide links to this BioAssay if available.

Taxonomy: NCBI Entrez Taxonomy links to this BioAssay if available.

Structure: MMDB links to this BioAssay.

Gene: Gene links to this BioAssay.

BioAssay: The BioAssays related to this one.

Description: The BioAssay's description provided by the depositor.

Protocol: The BioAssay's protocol provided by the depositor.

Comment: The BioAssay's Comment provided by the depositor.

Categorized Comment: The BioAssay's Categorized Comment provided by the depositor.

Result Definition: The BioAssay's result definition provided by the depositors.

Test Concentration: The concentration in which compounds are tested in any BioAssay.

Activity Concentration: The concentration which produces 50% of the maximum activity. Same as IC50, EC50, etc.


BioActivity Analysis

BioActivity Analysis shows the activity analysis for a set of compounds/substances and BioAssays. It has three views: Summary, Data Table, and Structure-Activity as described below.


BioActivity Analysis - Summary:

This is one of the three views of "BioActivity Analysis". It displays tested compound/substance activity summary across multiple BioAssays. There are three subviews under the "BioActivity Analysis - Summary": BioAssays, Targets, and Compounds. These view pages provide the bioactivity information for each bioassay, target, and compound, respectively.
There are three sections on each of the three view pages: "Revise BioAssay and Compound Selection", "Table", and "Extra Options"."Revise BioAssay and Compound Selection" provides users additional information for assay and compound/substance counts in several subgroups and a way to revise assay and compound/substance lists for the counts in Table. By clicking one of these subgroups, one can revise assay or compound/substance lists to be used to calculate counts for the Table. These subgroups serve as filters. The "Extra Options" provide users ways to switch between compound and substance views, and download the data table (in CSV format).


Launch the BioActivity Analysis page:

Users can launch this page from PubChem Substance, Compound and BioAssay summary reports in Entrez, where users may click the display pull-down, choose "PubChem BioActivity Summary", and see the compound/substance activity distribution across all BioAssays. If launching from Entrez PubChem BioAssay, users see all active compounds across each BioAssay. Other launch points for this service are available from "BioAssay Summary", "Data Table" and "Structure-Activity Analysis services", or https://pubchem.ncbi.nlm.nih.gov/assay/assay.cgi. Most of these launch points will normally lead user to the "BioActivity Analysis - BioAssays" page. User then can go other pages by clicking the corresponding tab on the page. Also user can launch "Targets" or "Compounds" page directly by the URL //pubchem.ncbi.nlm.nih.gov/assay/assaytool.cgi?q=tgt&cid;=xxxx (or aid=xxxx) or //pubchem.ncbi.nlm.nih.gov/assay/assaytool.cgi?q=cmp&cid;=xxxx (or aid=xxxx), respectively, for specific AID(s) or CID(s), where xxxx is the AID or CID.


BioActivity Analysis - Summary - BioAssays:

You can go to https://pubchem.ncbi.nlm.nih.gov/assay/assay.cgi, click the tab "Assay-centric", and search by the protein target name (e.g., "Protein kinase C alpha type; PKC-A; PKC-alpha"). The table of "BioAssays" contains AID, active compound/substance count, inactive compound/substance count, total tested count, the counts of the compound/substance with active concentration less or equal 1uM or 1nM, the range of active concentration, the BioAssay name, and the protein target name. Clicking on each count number leads to respective Data Table.


BioActivity Analysis - Summary - Targets:

You can go to https://pubchem.ncbi.nlm.nih.gov/assay/assay.cgi, click the tab "Target-centric", and search by the protein target name (e.g., "Protein kinase C alpha type; PKC-A; PKC-alpha"). The table of "Targets" displays the tested results for each protein gis. The table contains, for each protein target, the target name, bioassay count, chemical probe count, active compound/substance count, the counts of the compound/substance with active concentration less or equal 1uM or 1nM, and the total tested compound/substance count. Some table columns are hidden by default. Users can click "More Columns" on the right-upper corner to show all columns.

All table columns are sortable. A sortable column features cell background color change from light-grey to orange when mouse point is placed in the column head. User can click the column head title to sort the table by the column. The arrow just after the head title indicates the sorting direction.

"Tips" If one wants to look at the Bioactivity results in PubChem BioAssay database for certain targets, he can select the interested targets on "Targets" view page, click the other tab ("Assays", or "Compounds"), and then click the "Targets" tab again. When one makes any selection with some check boxes checked on the left of each table row, only the selected subset of targets will be carried to the next page when user clicks any of "DataTable", "Structure-Activity", "BioAssays", and "Compounds" buttons.

"Count Links" All counts for targets and bioassays go to NCBI Entrez page to display the list. The counts for compounds/substances have links to variable pages for detail information for these counts. When one clicks these links, several options are popped up. One goes to NCBI Entrez page to display the list. One goes to PubChem Data for the tested results for these counts. And another one goes to PubChem Structure-Activity (SAR) analysis.


BioActivity Analysis - Summary - Compounds(Substances):

If you search "aspirin" at https://pubchem.ncbi.nlm.nih.gov/assay/assay.cgi, the table of "Compounds(Substances)" displays the tested results for each chemical compound(substance). By default the table contains counts for each chemical compound: the bioassay counts in which the compound has been concluded as "chemical probe", active, with active concentration less or equal 1uM or 1nM, and has been tested. Also the table contains the protein target counts where the compound has been tested active against the protein target. Unique protein target counts are used here, which means all protein targets with the identical sequence are grouped together and just count one here. The compound's active concentration range (values of IC50, etc.) is also provided in the table.

User can select "Substance" in the dropdown menu at the "Display Results By" of the "Result Display Option" section just below the table and then click "Apply" button to switch to substance view from compound view.

"Tips" If one wants to look at the Bioactivity results in PubChem BioAssay database for certain compounds, he can select the interested compounds on "Compounds" view page, click the other tab ("Assays", or "Targets"), and then click the "Compounds" tab again. When one makes any selection with some check boxes checked on the left of each table row, only the selected subset of compounds will be carried to the next page when user clicks any of "DataTable", "Structure-Activity", "BioAssays", and "Targets" buttons.

"Count Links" All counts for bioassays, compounds/substances, and targets have links to variable pages for detail information for these counts. When one clicks these links, several options are popped up. One goes to NCBI Entrez page to display the list. One goes to PubChem Data Table for the tested results for these counts. And another one may go to PubChem Structure-Activity (SAR) analysis.


Data Table tab shows result data table for selected or all (when no selection, maximum up to 50) BioAssays with the substance set in the page.

Structure-Activity tab shows Structure-Activity Analysis for selected or all BioAssays with the substance/compound set in the page.

Revise Substance/Compound Selection allows you to reset substance/compound based on few options.

Revise BioAssay Selection allows you to reset BioAssays from following options.

Result Display Option - Group Results by: Users can switch between Compound and Substance.

BioActivity Analysis - Data Table:

This is one of the three views of "BioActivity Analysis". Other views "Summary" and "Structure-Activity" are available as tab options. The Data Table Page displays the searched results. There are four menus for Data Table: "Data Table, Concise", "Data Table, Complete", "Plot", and "Select".


Result Display Option: is defined above.
Save View: is defined below.
Result Exports allows you to download result set including chemical structures and readouts in the chosen format. If the FTP connection hangs, it is likely that the Passive FTP mode is not set up in your windows. You can use the Passive FTP following these steps. Step 1, Click the Start button to open the Start menu. Step 2, Type "Internet Options" (without quotes) in the search box. Step 3, Click "Internet Options" in the list of results to open the Internet Properties dialog window. Step 4, Click the "Advanced" tab to select it. Step 5, Scroll down to the Browsing heading and check the box next to "Use Passive FTP (for Firewall and DSL Modem Compatibility)." Click "OK" to confirm the setting.

Data Table, Concise-- shows concise results, which contains activity outcome, score and active concentration if provided.

Data Table, Complete-- shows test results corresponding to complete or selected readout fields.

Dose-Response Curve -- If the data table has dose-response data, the icon shows the link to the Dose-Response Curve.

Curve fit with the Hill Equation: The experimental data are fit with the Hill equation v = (V * S^n / (K + S^n)) + b or v = - (V * S^n / (K + S^n)) + b using the nonlinear regression algorithm described at (Pinto et al., 1984) without weighting. Here v is the response, S is the concentration, n is the Hill coefficient, K is the apparent dissociation constant, V is the maximum response, and b is a parameter related to the baseline response. The experimental data are shown as colored symbols, and the fit curve is shown in black.

You can choose linear or log scales for both X- and Y-axis. The default is log scale for X-axis and linear scale for Y-axis. The "Download Data" button shows the dose-response data. The "Data Table" button links to the data table for the pair of assay and chemical. If more than one curves are available, you can click the button "One Curve per Graph" to show each curve in a separate graph.

BioAssay Plot -- This page provides an interface for plotting "Scatter Plot" and "Histogram". Users can select up to 5 rows. The "Scatter Plot" will show figures for all pairs. The "Histogram" will show figures for all rows. Users can also click on each to get the histogram for that row.

Scatter Plot and Histogram: Clicking two diagonal points in the figure, you can view the data with four options: "Plot selected data", "Show selected data", "Show selected data, active only", and "Show selected data, inactive only".

BioAssay Select -- This page provides an interface to let you to carry out the BioAssay result search.

Navigate Buttons:

Press the "Show" button to retrieve the BioAssay(s) data table results based on your query criteria.
Press the "Clear" button to clear/reset the query form.

Summary Results provides a search interface for you. You can search the activity outcome, rank score, and/or test date from the displayed search form. Click the to expand the BioAssay result search form. (Then the will be shown up. Click it will collapse the form)

Outcome Filter allows you to select tested compounds/substances based the activity outcome. The checkbox allows the outcome to be displayed in the result page. By default, it is checked.

Activity Score Filter allows you to select tested compounds/substances based the activity rank score. The checkbox allows the rank score to be displayed in the result page. By default, it is checked.

Updated Date Filter allows you to the date range for the assay. By default, all result will be returned if no input. The input format is yyyy/mm/dd. mm and dd are optional.

Other Experimental Results provides a detailed search interface for you. Click the to expand the BioAssay result search form. (Then the will be shown up. Click it will collapse the form)

All results fields are checked by default. You can unselect/select all by click the checkbox in the header row. Selected results will be displayed in the result page.

Results with integer/float type can be searched with lower-bound value and/or upper-bound value. String type results can be searched by either select one string term from the dropdown list or by a pattern string. Boolean type result can be searched by select one radio button.

Pattern search: You can use pattern to perform a string search. A PATTERN is a part of a search term.

Result Filter: There are few result filters to allow you to make your result search.

Substance Filter. You can provide a SID list using list file, list text, or Entrez history to your search.

Compound Filter. You can provide a CID list using list file, list text, or Entrez history to your search.

Select Other BioAssays provide a function to allow you to add/change BioAssays. WE DON'T ENCOURAGE YOU TO PROCESS MULTIPLE BIOASSAYS UNLESS YOU KNOW TWO OR MORE BIOASSAYS HAVE RELATION SHIP AND YOU WANT TO COMPARE THEIR RESULTS. You can choose up to 5 BioAssays to process their data together.


BioActivity Analysis - Structure-Activity:

This is one of the three views of "BioActivity Analysis". It shows the Structure-Activity relationship in a heatmap display. The sample page is shown below. The default limit of compounds and BioAssays is set to 1000 in order to get the job done in around one minute. If more than 1000 compounds or BioAssays are input, a warning message will show up and you can change the limit to a number <= 4000. However, users need to wait for more than one minute to get the job done.


Compound/BioAssay Clusters: This is probably the most important feature in this tool. Users can cluster compounds and BioAssays differently to do the Structure-Activity analysis. Compounds could be clustered by "2D Structure", "3D Structure", or "Activity" similarities. BioAssays could be clustered by "Activity", "Protein Target" sequence, "Depositor-Specified", or "BioSystems" similarities.

Activity Data: There are four kinds of activity data: Activity Outcome, Activity (IC50 etc.), Linear Score, and Percentile Score. These activity data are used in the clustering by Activity Similarity. They are also shown in the Heatmap. Each cell in the Heatmap corresponds to the test result of one compound in one BioAssay.

3D Conformer Similarity

During the 3D similarity calculation, 3D Superposition is Optimized by either "Shape" or "Feature". If the calculation is shape-optimized, the 3D similarity can be represented by the normalized "Shape + Feature" similarity score, or just "Shape" similarity score. If the calculation is feature-optimized, the 3D similarity can be represented by the normalized "Feature + Shape" similarity score, or just "Feature" similarity score.

A certain Number of Conformers per Compound (nconf) is chosen to finish the calculation in 1-2 minutes. This number "nconf" is also shown in the clustering image. You can increase this number up to 10 to use more conformers in the calculations.

Revise Selection: Users can revise both compound/substance and BioAssay. The detailed options are hidden by default. Users can click the "+" sign near Revise Selection to show the details.

Revise Compound Selection:"Select Active" selects subset of compounds/substances active in some of the selected BioAssays. "Add Active" adds additional compounds/substances active in some of the selected BioAssays. Similarly, you can also "Select" or "Add" by Active Concentration (IC50, etc). You can also remove a list of comma-separated CIDs from the display. "Add Similar Compounds/Substances" adds compounds/substances similar to the current ones. "Add Similar Conformers" adds compounds/substances with its 3D conformers similar to the conformers of the current set.

Revise BioAssay Selection: "Select Active" selects subset of BioAssays in which some of the selected compounds/substances are active. "Add Active" adds additional BioAssays in which some of the selected compounds/substances are active. Similarly, you can also "Select" or "Add" by Active Concentration (IC50, etc). You can also remove a list of comma-separated AIDs from the display. "Add Related BioAssays" has four pop-up options: "by Target Similarity", "by Activity Overlap", "by Depositor", and "by BioSystems". "Select by BioAssay Type" has four pop-up options: "Summary/Confirmatory", "Summary", "Confirmatory", and "Primary Screening". "Defined Protein Target" shows only those BioAssays with protein targets defined. "Defined BioSystems" shows only those BioAssays with BioSystems defined.

Counts: The "Counts" only appear when users launch the Heatmap for the first time. The counts include the "Input" compounds or BioAssays, the number of compounds or BioAssays "Shown" in the Heatmap.

Export:

Data Table: Users can export "BioActivity Data", "Compound Similarity Scores", and "BioAssay Similarity Scores". "BioActivity Data" shows the data corresponding to each cell in the heatmap-style display. There are three kinds of data: Activity Outcome, Score, and Active Concentration (if available). "Compound Similarity Scores" and "BioAssay Similarity Scores" show the similarity scores used to generate the dendrograms.

Conformers Used: Each compound could have up to 10 conformers in the 3D similarity calculation. The most similar conformer pair is used to represent the 3D similarity of a pair of compounds. You can export these conformer pairs.

Image: Users can export the display in one full PNG image since the display may consist of many small images.

Clusters in GML: Users can export the clusters as a Graph Modelling Language (GML) file, which can be viewed in other softwares such as Cytoscape. The GML file format can be easily converted to other formats such as the eXtensible Graph Markup and Modeling Language (XGMML), Graph eXchange Language (GXL), and GraphML.

Result Display Option: is defined above.
Save View: is defined below.

Select a region in Heatmap: One way to show a subset of the Heatmap is to click two diagonal points in the Heatmap to select compounds and BioAssays in the region as shown in the image below. A menu will pop up with six options. The first option "Zoom in" displays a new Heatmap with compounds and BioAssays in the selected region. The second option "BioActivity Summary, Selected Compounds and BioAssays" shows the selected compounds and BioAssays in PubChem BioActivity Summary page. The third option "BioAssay Data Table, Selected Compounds and BioAssays" shows the selected compounds and BioAssays in PubChem Data Table page. The fourth option "Selected Compounds in Structure Clustering" shows the selected compounds in PubChem Structure Clustering page with all structures displayed. The last two options "Selected Compounds in Entrez" and "Selected BioAssays in Entrez" show the selected compounds or BioAssays in Entrez.

Click Blue Circles in Clusters: As shown in the following two images, if you click on a blue circle or the line above the circle in the Compound Cluster or BioAssay Cluster, a menu will pop up. The options for the Compound Cluster are "Compounds in Structure Clustering", "Compounds in Entrez", "Compounds in BioActivity Summary", "2D/3D Structure Similarity Scores" or "Activity Similarity Scores", "Expand Subtree", and three Revise Selections: "Display Subtree Only", "Remove Subtree & Display the Rest", "Add Similar Compounds", and "Add Similar Conformers". There are two more options for 3D Structure similarity: "Compounds in 3D Viewer" and "Conformers Used".

The options for the BioAssay Cluster Tree are "BioAssays in Entrez", "BioAssays in BioActivity Summary", "Similarity Scores" ("Activity Similarity Scores", "Target Sequence Identities", "Depositor-specified similarity", or "BioSystems similarity"), and four Revise Selections: "Display Subtree Only", "Remove Subtree & Display the Rest", "Add Related BioAssays, by Activity Overlap", "Add Related BioAssays, by Target Similarity", "Add Related BioAssays, by Depositor", and "Add Related BioAssays, by BioSystems".



Common Substructures: As shown in the following image, if you mouseover a node (blue circle) or the line on its left, the common substructures for the compounds in the subcluster will pop up. Currently only the 2D common structures are shown. If the similarity of the node is >= 0.9, the common fingerprint bits greater than 574 are shown. Otherwise, the fingerprint bits greater than 713 are shown.

Collapse Compound Cluster Tree: The Compound Cluster Tree can be collapsed if users click on the ruler as shown above. The subtrees beyond the collapsed Tanimoto score will be collapsed into a node, which can be expanded. The corresponding rows in the Heatmap are collapsed as well. The color of collapsed cells is the mixture of green and yellow.


Related BioAssays by Target Similarity:

This page shows the related BioAssays based on target similarity between the queried BioAssay and all the rest BioAssays. The top 10 related BioAssays are pre-selected.

Protein Target: the protein(s) which the compounds interact with in the BioAssay.

Target Similarity: the similarity of the protein sequences for a pair of targets in two BioAssays. Both Sequence Identity and Blast E-value are shown. The related BioAssays are sorted by Sequence Identity.

Sequence Alignment: the sequences of protein targets in the original BioAssay and the Related BioAssay are aligned using Blast 2. If there are multiple targets in one BioAssay, only the target with the highest similarity is shown.


Related BioAssays by Activity Overlap:

This page shows the related BioAssays based on activity overlap between the queried BioAssay and all the rest BioAssays. The top 10 related BioAssays are pre-selected.

Activity Similarity: For BioAssays A and B, the activity similarity = [Active in Both A and B] / ([Active in A] + [Active in B] - [Active in Both A and B]).

Active in Both: Links to compounds active in both the queried BioAssay and the BioAssay listed in the current row.


Related BioAssays by Depositor:

This page shows the related BioAssays specified by depositors. The top 10 related BioAssays are pre-selected.



Related BioAssays by Common BioSystems:

This page shows the related BioAssays based on common BioSystems between the queried BioAssay and all the rest BioAssays. The top 10 related BioAssays are pre-selected.

BioSystems: biological pathways and other BioSystems containing the target sequences of BioAssays. There are two kinds of BioSystems: organism-specific and across-species.


BioAssay View File:

A BioAssay view file enables users save the state of a BioAssay display so that users may view it again at a later date or to share with colleagues. Please note that PubChem data may change over time as depositors add, update, and delete data. As such, saving a view does not absolutely guarantee that exactly the same information will be displayed at a later date. The BioAssay view file is in XML format. The specification for this file can be found at: ftp://ftp.ncbi.nih.gov/pubchem/specifications/pug.xsd.


PubChem Cross Links

back to top

PubChem provides cross links to other databases when that information is available. You can find those links from either Entrez PubChem pages or individual record summary pages. These links are reciprocal. Other databases also link back to PubChem. The links work well for a single ID input, e.g., the literature about "aspirin" (CID 2244) can be found using the url https://www.ncbi.nlm.nih.gov/pubmed?LinkName=pccompound_pubmed_mesh&from;_uid=2244. Some links were removed in October 2016. Click here to see the list.

Links in PubChem Compound:
pccompound_biosystems
pccompound_gene
pccompound_mesh
pccompound_nuccore
pccompound_omim
pccompound_pcassay
pccompound_pcassay_active
pccompound_pcassay_activityconcmicromolar
pccompound_pcassay_activityconcnanomolar
pccompound_pcassay_inactive
pccompound_pcassay_probe
pccompound_pccompound
pccompound_pccompound_3d
pccompound_pccompound_mixture
pccompound_pccompound_parent
pccompound_pccompound_parent_connectivity_pulldown
pccompound_pccompound_parent_isotopes_pulldown
pccompound_pccompound_parent_pulldown
pccompound_pccompound_parent_stereo_pulldown
pccompound_pccompound_parent_tautomer_pulldown
pccompound_pccompound_sameanytautomer_pulldown
pccompound_pccompound_sameconnectivity_pulldown
pccompound_pccompound_sameisotopic_pulldown
pccompound_pccompound_samestereochem_pulldown
pccompound_pcsubstance
pccompound_pcsubstance_same
pccompound_pmc
pccompound_protein
pccompound_pubmed
pccompound_pubmed_mesh
pccompound_pubmed_publisher
pccompound_structure
pccompound_taxonomy
Links in PubChem Substance:
pcsubstance_biosystems
pcsubstance_books
pcsubstance_gene
pcsubstance_nuccore
pcsubstance_omim
pcsubstance_pcassay
pcsubstance_pcassay_active
pcsubstance_pcassay_activityconcmicromolar
pcsubstance_pcassay_activityconcnanomolar
pcsubstance_pcassay_inactive
pcsubstance_pcassay_probe
pcsubstance_pccompound
pcsubstance_pccompound_same
pcsubstance_pmc
pcsubstance_probe
pcsubstance_protein
pcsubstance_pubmed
pcsubstance_pubmed_publisher
pcsubstance_structure
pcsubstance_taxonomy
Links in PubChem BioAssays:
pcassay_books_probe
pcassay_gene_alltarget_list
pcassay_gene_rnai
pcassay_gene_rnai_active
pcassay_gene_target
pcassay_nucleotide
pcassay_nucleotide_dna_target
pcassay_nucleotide_rna_target
pcassay_omim
pcassay_pcassay_activityneighbor_list
pcassay_pcassay_assay_project
pcassay_pcassay_common_gene_list
pcassay_pcassay_gene_interaction_list
pcassay_pcassay_neighbor_list
pcassay_pcassay_same_assay_project_list
pcassay_pcassay_same_publication_list
pcassay_pcassay_similar_publication_list
pcassay_pcassay_targetneighbor_list
pcassay_pccompound
pcassay_pccompound_active
pcassay_pccompound_activityconcmicromolar
pcassay_pccompound_activityconcnanomolar
pcassay_pccompound_inactive
pcassay_pccompound_probe
pcassay_pcsubstance
pcassay_pcsubstance_active
pcassay_pcsubstance_activityconcmicromolar
pcassay_pcsubstance_activityconcnanomolar
pcassay_pcsubstance_inactive
pcassay_pcsubstance_probe
pcassay_probe
pcassay_protein_target
pcassay_pubmed
pcassay_pubmed_major
pcassay_structure
pcassay_taxonomy

PubChem Indexes and Filters in Entrez

back to top

The PubChem index search is a very powerful tool within the Entrez system. Users can simply type search term(s) followed by the bracketed index field name. Then click the "Go" button.

Examples:


  Search for DTP/NCI's record with NSC#78:
      On the PubChem homepage or Entrez search page, enter "DTP/NCI[Sourcename], 78[objectid]" in
      the search box, then click the Go button.

  Search for all compounds containing gold:
      On the PubChem homepage or Entrez search page enter "Au[el]", and click the Go button.

  Search for all compounds with heavy atom count between 10 and 12:
      On the PubChem homepage or Entrez search page, choose 'Pccompound' database from the
      search dropdown list, enter "10:12[hac]", and click the Go button.

The following fields can be searched within Entrez PubChem databases (with field aliases in square brackets; pick one alias that's easily memorized in case multiple aliases are available). For integer/real number fields, the range search can be done as shown above. Some indices and filters were removed in October 2016. Click here to see the list.


PubChem Compound:

All [ALL]: All of the following fields are searched. If a string query is presented without a field alias, by default, [ALL] is searched.
Uid [UID]: The integer represents CID for each Pccompound database. By default, an integer without a field alias is recognized as a UID. Same as [CID].
Filter [Filter]: Limits the records. A number of filters are available to restrict the search to compounds with particular information. The specialized Filters in this database are: ActiveAidCount [AC, ACNT]: Using this filter users can query for compounds which are active in a certain number of assays
ActiveAidRatio [AAR]: Ratio should be between zero and 1. Ratio equals to the number of BioAssays where compounds were tested active divided by number of BioAssays where compounds tested with any result.
AtomChiralCount [ACC, ACCNT]: Total count of chiral atoms in a given compound, integer.
AtomChiralDefCount [ACDC, ACDCNT]: Total count of defined chiral atoms in a given compound, integer.
AtomChiralUndefCount [ACUC, ACUCNT]: Total count of undefined chiral atoms in a given compound, integer.
BondChiralCount [BCC, BCCNT]: Total count of chiral bonds in a given compound, integer.
BondChiralDefCount [BCDC, BCDCNT]: Total count of defined chiral bonds in a given compound, integer.
BondChiralUndefCount [BCUC, BCUCNT]: Total count of undefined chiral bonds in a given compound, integer.
CompleteSynonym [CSYN, CSYNO]: Compound's synonyms, based on all substance related to this compound.
Complexity [CPLX]: Compound complexity.
CompoundID [CID]: Compound ID. Same as [UID].
CovalentUnitCount [CUC, CUCNT]: Integer.
CreateDate: Date this compound created in PubChem.
Element [ELMT, EL]: Chemical element in a compound.
ExactMass [EMAS, EXMASS]: The calculated mass of an ion or a molecule containing most likely isotopic composition for a single random molecule, corresponding to mass of most intense mol/molecule peak in a MS spec. A real number.
HeavyAtomCount [HAC, HACNT]: Atom count in a compound except hydrogen, integer.
HydrogenBondAcceptorCount [HBAC, HBACNT]: Hydrogen bond acceptors for a compound, integer.
HydrogenBondDonorCount [HBDC, HBDCNT]: Hydrogen bond donors for a compound, integer.
InChI [INCH, INCHI]: Standard IUPAC International Chemical Identifier. More info..
InChIKey [INCHIKEY]: Standard IUPAC International Chemical Identifier Key.


InChI string and InChIKey can be searched through the Entrez PubChem databases. e.g.

To search with the InChIKey of aspirin: "BSYNRYMUTXBXSQ-UHFFFAOYSA-N":

Type or paste "BSYNRYMUTXBXSQ-UHFFFAOYSA-N"[InChIKey] into the PubChem Compound, or PubChem Substance, or the Entrez Global search box, then click Go button.

Note:
     The quote marks and the square brackets are required.
     'InChI=' is required when search with an InChI string.
IsotopeAtomCount [IAC, IACNT]: Isotope atom numbers in a compound.
IUPACName [UPAC, IUPAC]: Standard IUPAC name for compound.
MeSHTerm [MSHT, MESHT]: Medical Subject Heading term. Note that MeSH entry terms (synonyms for the Medical Subject Heading term) are also indexed.
MolecularWeight [MW, MWT, MOLWT]: Mass of a molecule calculated using the average mass of each element weighted for its natural isotopic abundance. E.g., Carbon has two natural isotopes 12 and 13 with relative abundances of 98.9% and 1.1% to yield an average mass of 12.011 g/mol. A real number.
MonoisotopicMass [MMAS, MIMASS]: Mass of a molecule calculated using the mass of the most abundant isotope of each element. E.g., Carbon has a monoisotopic mass of 12.000 g/mol. A real number.
PharmAction [PHMA, PHARMA]: MeSH pharmacological actions.
RotatableBondCount [RBC, RBCNT]: Count of rotatable bonds
SourceName [SRC, SRCNAM, SRCNAME]: Depositor name officially recorded in PubChem databases. For current data sources look here
SourceCategory [SRCC, SRCCAT, SRCCATG]: Depositor categories. For more information and possible categories look here
SubstanceID [SID]: Substance identifier, integer.
Synonym [SYNO]: Synonyms for substance.
TotalAidCount [TAC]: TotalAidCount includes any assay that a compound is tested, it should cover active/inactive/inconclusive/unspecified
TotalFormalCharge [TFC, CHG, CHRG]: Total formal charge.
TPSA[TPSA]: Topological Polar Surface Area.
XLogP [XLGP, LOGP]



PubChem Substance:

All [ALL]: All of the following fields are searched. If a string query is presented without a field alias, by default, [ALL] is searched.
Uid [UID]: The integer represents SID for Pcsubstance database. By default, an integer without a field alias is recognized as a UID. Same as [SID].
Filter [Filter]: Limits the records. A number of filters are available to restrict the search to substances with particular information. The specialized Filters in this database are: AssaySourceName [ASRC, ASRCNAM, ASRCNAME]: Allows filtering of by assay source name. For available data sources look here
Comment [CMT]: Substance or BioAssay comment.
CompleteSynonym [CSYN, CSYNO]: Compound's synonyms, based on all substance related to this compound.
ComponentCID [CCID]: Component compound identifier.
CompoundID [CID]: Compound identifier, integer.
DepositDate [DDAT, DEPDAT]: Deposition timestamp for a substance.
ModifyDate: Date this substance record is modified.
SourceCategory [SRCC, SRCCAT, SRCCATG]: Depositor categories. For more information and possible categories look here
SourceID [SRID, SRCID]: Depositor's external id.
SourceName [SRC, SRCNAM, SRCNAME]: Depositor name officially recorded in PubChem databases. For current data sources look here
SourceReleaseDate [SRD, SRDAT, RLSDAT]
StandardizedCID [SCID]: Standardized compound identifier, integer.
SubstanceID [SID]: Substance ID. Same as [UID].
Synonym [SYNO]: Synonyms for substance.
TotalAidCount [TAC]

 
PubChem BioAssay:

All [ALL]: All of the following fields are searched. If a string query is presented without a field alias, by default, [ALL] is searched.
Uid [UID]: The integer represents AID for Pcassay database. By default, an integer without a field alias is recognized as a UID.
Filter [Filter]: Limits the records. A number of filters are available, to retrieve records in the same or other databases that the current BioAssay records are cross-referenced to. ActiveSidCount [AC, ACNT]: Number of substances (identified by SID--substance identifier from Pcsubstance) that are considered as active in a BioAssay.
Activity Outcome Method [ACMD]: Description on how activity outcome is determined. Choices of search query include: AssayComment [ACMT, ACMMNT]: comment for a BioAssay provided by depositor.
AssayDescription [ADES, ADESC, ADSC]: Description for the BioAssay.
AssayName [ANAM, ANAME]: Name of a BioAssay provided by depositor.
AssayProtocol [APRL, APRTL]: Protocol for a BioAssay provided by depositor.
AssaySourceID [ASRD, ASRID]: External assay source identifier.
DepositDate [DDAT, DDATE]: Date when BioAssay record is deposited into PubChem. Date format is yyyy/mm/dd. mm and dd are optional.
GrantNumber [GRN,GRNUM]: NIH Grant Numbers
ModifyDate [MDAT, MDATE]: Last date when a BioAssay data content is modified. Date format is yyyy/mm/dd. mm and dd are optional.
NucleicAcidReagentID [NARD,NARID]: NCBI Probe Database identifiers(ProbeDB ID) referred by BioAssay
PigGI [PIGI,PIGGI]: Identical sequence NCBI Protein GI number similar to a BioAssay target
ProbeCidCount [ACC, ACCNT]: Number of unique chemicals (identified by CID--compound identifiers from Pccompound) that are considered as probe in a BioAssay.
ProteinTargetGI [PTGI]: NCBI Protein GI number of a BioAssay protein target
ProteinTargetName [PTN]: NCBI Protein name of a BioAssay protein target
RNATargetGI [NARD]: NCBI Nucleotide GI number of a BioAssay nucleotide target
ReleaseDate [RDAT, RDATE]: Date when a BioAssay data is released to public by PubChem. Date format is yyyy/mm/dd. mm and dd are optional.
SourceCategory [SRCC, SRCCAT, SRCCATG]: Category of BioAssay data source
SourceName [SNME, SNAME]: Source name of a BioAssay data specified by depositor.
SynonymTested [SYNT]: MESH names and synonyms that are associated with any chemical structure tested in a BioAssay.
TaxonomyName [TXNM,TXNAM,TXNAME]: NCBI Entrez Taxonomy name.
TotalSidCount [TSC]: Total number of substances tested in a BioAssay.



PubChem 3D

back to top

PubChem generates a theoretical 3D description of each compound in the PubChem Compound database that is
  1. not too large (<= 50 non-hydrogen atoms).
  2. not too flexible (<= 15 rotatable bonds).
  3. consists of only organic elements (H, C, N, O, F, P, S, Cl, Br, and I).
  4. has only a single covalent unit (i.e., not a salt or a mixture).
  5. contains only atom types recognized by the MMFF94s force field.
For more information, please visit PubChem 3D Release Notes.

PubChem also provides 3D viewers for both desktop application and web-based interface.





PubChem FAQ

back to top

  1. What is PubChem ?

  2. What is PubChem Substance ?

  3. What is PubChem Compound ?

  4. What does the depositor's category tell users and what are the existing depositor categories ?

  5. Why search PubChem Substance and/or Compound ?

  6. What is PubChem BioAssay ?

  7. How does PubChem assign Substance identifiers ? When a substance is revoked by the depositor, can I still see the old record ?

  8. How does PubChem assign Compound identifiers ? Will the structure represented by a CID ever change ?

  9. How does PubChem process my deposited structures ?

  10. How do I process a text search with PubChem databases  ?

  11. How do I perform a structure search ?

  12. How do I save my search result ?

  13. Sometime I see errors in the substance record, where should I report ?

  14. What are exact mass and monoisotopic mass for a substance/compound ?

  15. How do I find INCHI version and parameters?

  16. What is the legacy designation ?

Q: What is PubChem ?

A: PubChem is a component of NIH's Molecular Libraries Roadmap Initiative. It provides information on the biological activities of small molecules. PubChem is organized as three linked databases within the NCBI's Entrez information retrieval system. These are PubChem Substance, PubChem Compound, and PubChem BioAssay. PubChem also provides a fast chemical similarity search tool.

Q: What is PubChem Substance ?

A: PubChem Substance records contain substance information electronically submitted to PubChem by depositors. This includes any chemical structure information submitted, as well as chemical names, comments, and links to the depositor's web site.

Q: What is PubChem Compound ?

A: PubChem compound records comprise a non-redundant set of standardized and validated chemical structures. A compound record may link to more than one PubChem Substance record, if different depositors supplied the same structure. Chemical names shown in PubChem Compound records are a composite derived from all linked substances, with default ranking of names by weighted frequency of use.

Q: What does the depositor's category tell users and what are the existing depositor categories ?

A: The depositor categories indicate the type of information one may expect to find when following the depositor substance URL or the type of information provided by the depositor. A list of possible categories include the following:

Status Meaning
Biological Properties Depositor provides information about the biological properties of a substance or compound
Chemical Reactions Depositor provides information about the reactivity, synthesis, or known reactions of a substance or compound
Imaging Agents Depositor provides information about the contrast agent or imaging agent used in, for example, MRI's
Journal Publishers Depositor is a journal publisher and has articles published about a substance or compound
Metabolic Pathways Depositor provides information on the metabolic pathways involving a substance or compound
Molecular Libraries Screening Center Network Depositor is part of the NIH Molecular Libraries Screening Center Network (MLSCN)
NIH Substance Repository Depositor is an NIH Molecular Libraries Small Molecule Repository servicing the MLSCN
Physical Properties Depositor provides information about the experimental physical properties of a substance or compound
Protein 3D Structures Depositor provides information about the experimental 3-D structure of a substance or compound
Substance Vendors Depositor is a seller of a substance or compound
Theoretical Properties Depositor provides information about the theoretical properties of a substance or compound
Toxicology Depositor provides information about the toxicological properties of a substance or compound

Q: Why search PubChem Substance and/or Compound ?

A: It is useful to search PubChem's Substance database when one is looking for information from a particular depositor exclusively, and/or when one is looking for information on substances such as natural product extracts which may not have associated chemical structure information. These special cases aside, it is generally most useful to search for chemical names or structures in PubChem's Compound database. This provides a concise view, combining information derived from multiple Substance records that specified the same structure. PubChem's structure search service operates on PubChem's Compound database exclusively.

Q: What is PubChem BioAssay ?

A: The PubChem BioAssay Database contains BioActivity screens of chemical substances described in PubChem Substance.  It provides searchable descriptions of each BioAssay, including descriptions of the conditions and readouts specific to that screening procedure.

Q: How does PubChem assign Substance identifiers ? When a substance is revoked by the depositor, can I still see the old record ?

A: A PubChem Substance SID is assigned to each unique external registry ID provided by a PubChem data source. A depositor may "revoke" (or otherwise deprecate) a PubChem SID at any time for any reason. However, the link to the "revoked" PubChem SID lives on in perpetuity. There will be a message stating the depositor deprecated the SID, but the link to the archived information will still be available. In addition, the PubChem CID's pointed to by the old version of a PubChem SID at the time it was versioned or deprecated will also be available.

Q: How does PubChem assign Compound identifiers ? Will the structure represented by a CID ever change ?

A: A PubChem Compound CID is assigned to each unique chemical structure. It is possible that different tautomeric forms of the same compound to have different CID's. The chemical structure represented by a CID is permanent. The URL links to the compound summaries are stable (always live), regardless if any (or no) substance points to them.

Q: How does PubChem process my deposited structures ?

A: The conversion of the deposited information goes through a series of validation steps (to confirm the structure is "valid") and then a series of standardization/normalization steps to remove VB redundancy.

The validation steps consist of:
  • Atom verification: do all atoms correspond to a known atomic element? E.g., "*" is not a known atom
  • Implicit hydrogens are assigned to organic elements using simple valence rules, e.g., methane "C" gets four implicit hydrogens assigned to it.
  • Functional group standardization: common incorrect and hypervalent representations of functional groups are "fixed", e.g., nitro groups represented by N(=O)=O become [N+](=O)[O-]
  • Atom valences are validated: do all atoms have an "allowed" valence? E.g., five bonds to carbon is not valid

    The standardization steps consist of:
  • Valence bond (VB) canonicalization: equivalent/alternate VB/tautomeric forms of a structure are normalized into a single representation
  • Aromaticity detection: structure aromaticity is detected and validated to be kekulizable
  • StereoChemistry detection: SP3 and SP2 stereo centers are detected and stereo-wedge placement standardized
  • Explicit hydrogen assignment: implicit hydrogens are converted to be explicit

    Subsequent additional processing includes 2D coordinate layout assignment.

    Q: How do I process a text search with PubChem databases ?

    A: PubChem's Substance, Compound, and BioAssay databases are fully integrated within NCBI's Entrez data retrieval system. You can process any name, keyword, or ID search through the Entrez system. The PubChem homepage also provides a search box. For a specific database query, see related content in the help document above.

    Q: How do I perform a structure search ?

    A: You can perform a structure search through the PubChem structure database. PubChem provides two search interfaces, basic structure search and advanced structure search. For more information, visit structure search help.

    Q: How do I save my search result ?

    A: To save your search from Entrez, you can use either the PubChem download facility or the Entrez generic search-save tool. You can get more help by clicking the two links above.

    Q: Sometime I see errors in the substance record, where I should report ?

    A: PubChem doesn't have curators and never changes/edits substance records. They remain as supplied by our depositors, just as with GenBank records. You can follow PubChem substance summary page to find the original record from the depositor's page, and report the error. Once the error is corrected by the depositor, PubChem will implement it at next update. For compound property/descriptor errors, you can report to the NCBI help desk.

    Q: What are exact mass and monoisotopic mass for a substance/compound ?

    A: An exact mass is the most likely isotopic composition for a single random molecule, corresponding to mass of most intense ion/molecule peak in a mass spectrum. A monoisotopic mass is a molecule calculated using the mass of the most abundant isotope of each element.
    For example, carbon has a monoisotopic mass of 12.000 g/mol.
    Exact mass and monoisotopic mass are the same for more than 90% of structures but differ when atom counts are such that presence of one or more lower abundance isotopes is most probable.

    For example, carbon tetrachloride, CCl4, PubChem CID 5943, has an exact mass of 153.872, where, in this case, the prototypical compound is made of three 35Cl, one 37Cl, and one 12C. The monoisotopic mass for carbon tetrachloride is 151.875, where in this case all chlorine atoms are assumed to be 35Cl, with isotope abundance of 75.77%, and the carbon atom is assumed to be 12C, with isotope abundance of 98.9%. In many cases, these two masses are identical, except for compounds with four or more Cl atoms, two or more Br atoms, or other elements not dominated by a single isotope, or for really large compounds such as with the number of carbons greater than 99, i.e., for C100 the exact mass will be 12C * 99 + 13C * 1, while the monoisotopic mass will be 12C * 100.

    Q: How do I find INCHI version and parameters?

    A: The InChI version and parameters are detailed in the ASN.1 and XML data for each compound. For example, for aspirin, you can find the information you are seeking by viewing the ASN.1 record:

    https://pubchem.ncbi.nlm.nih.gov/summary/summary.cgi?cid=2244&disopt;=DisplayASN1

    If you scroll down to the InChI property record, you will find:
    {
       urn {
          label "InChI",
          datatype string,
          parameters "options {auxnonr donotaddh w0 fixedh recmet newps}",
          implementation "E_INCHI",
          version "1.0.1",
          software "InChI",
          source "nist.gov",
          release "2007.09.04"
       },
       value sval "InChI=1/C9H8O4/c1-6(10)13-8-5-3-2-4-7(8)9(11)12/h2-5H,1H3,(H,11,12)/f/h11H"
    },
    The InChI version is (currently) 1.0.1. We are using the parameter options "auxnonr donotaddh w0 fixedh recmet newps".

    Q: What is the legacy designation ?

    A: PubChem uses a "legacy" designation to give users the option to filter collections that are not regularly updated. For more information, please see our legacy designation help page.


  • PubChem Courses

    back to top

    PubChem training materials, including slides and exercises, are available as part of A Librarian's Guide to NCBI. The last day of that five-day program includes coverage of Drugs and other small bioactive molecules (slides, exercises).

    NCBI has also provided PubChem training courses in the past. Although these courses have been superceded by the newer Discovery Workshops (accessible from the NCBI Education page), the PubChem course materials are still available and helpful in understanding the PubChem resources:

    PubChem Documents

    back to top


    Direct Link Services

    back to top

    PubChem services provide directly link urls to allow users to retrieve data based on the valid IDs.

    Glossary

    back to top

    AID: PubChem's BioAssay (protocol) identifier, a non-zero integer.

    BioActivity Types:

    CID: PubChem's compound identifier, a non-zero integer for a unique chemical structure.

    Complexity : The complexity rating of the compounds is a rough estimate of how complicated a structure is, seen from both the point of view of the elements contained and the displayed structural features including symmetry. However, neither stereochemistry nor isotope labeling are used as auxiliary criteria. The value is computed using the Bertz/Hendrickson/Ihlenfeldt formula. A scaling factor for aromaticity is used so that the complexity of benzene is the same as of cyclohexane. It is a floating point value, ranging from 0 (simple ions) to several thousand (complex natural products). Generally larger compounds are more complex than smaller ones, but highly symmetrical compounds, or compounds with few distinct atom types or elements are downgraded. Complexity is only loosely correlated with synthetic accessibility. The most complex compound in PubChem is CID 6338588 (C124H185N9O207S36) with a complexity rating of about 18425. The average complexity of the structures in PubChem compound database is about 551.

    Comments: List all depositor's comments and additional information for this substance.

    Component: For mixture substance/compound, component is one of the single molecule.

    Compound: Chemical representatives in substances. Chemical structure presented in a compound is standardized through PubChem's data pipeline. A mixture substance may have several standardized compounds. A compound record is structurally unique in the PubChem compound database.

    Computed Descriptors: Information to describe the compound in different formats, including SMILES, InChI, IUPAC names.

    Computed Properties: These data are calculated from the compound, including molecular weight, formula, XLogP, etc.

    Depositors Category: Depositors category tells users that there is an additional category-specific information either on depositors substance summary page or on the depositor's web-site.

    Deprecated Compound: A Compound CID which has no links to any substance. This may occur as PubChem modifies processing. A deprecated compound will not be available within Entrez.

    HBA: Number of hydrogen acceptors in the structure. Classification of hydrogens follows [J. Chem. Inf. Comput. Sci. 1997,37, 615-621].

    HBD: Number of hydrogen donors in the structure. Classification of hydrogens follows [J. Chem. Inf. Comput. Sci. 1997,37, 615-621].

    Heavy Atom: All atoms except hydrogen.

    InChI: IUPAC International Chemical Identifier. Learn more...  InChI string can be searched through the Entrez PubChem databases. Click here to see the example.

    Old Version Substance -- Substance versions are considered to be "old" when a more recent update is provided by the depositor.

    Molecular Formula: A way of expressing information about the atoms that constitute a particular chemical molecule.

    Molecular Weight: The molecular weight is the sum of all atomic weights of the constituent atoms in a compound, measured in gr/mol. In the absence of explicit isotope labeling, averaged natural abundance (which may, for example in case of Li and U compounds, not be identical to purchasable material) is assumed. If an atom bears an explicit isotope label, 100% isotopic purity is assumed at this location, even for short-lived radioactive isotopes where this is often physically unrealistic. At this moment, it is not possible to deposit more detailed isotope composition information into the PubChem database. Pseudo-atoms which are not an element have an atomic weight of 0 g/mol.

    Revoked BioAssay: When a depositor removes an assay that the depositor previously deposited into PubChem, the assay is considered revoked. A revoked assay will not be available within Entrez.

    Revoked Substance: When a depositor removes a substance from their substance collection, the substance is considered revoked. A revoked substance will not be available within Entrez.

    SID: PubChem's substance identifier, a non-zero integer for a deposited substance.

    SMILES: Simplified Molecular Input Line Entry System, a line notation (a typographical method using printable characters) for entering and representing molecules. Learn more..
    You can also find more related information form PubChem's document section in PDF or Text.

    SMARTS: A language that allows you to specify substructures using rules that are straightforward extensions of SMILES. Learn more..

    Stereochemistry: Relative spatial arrangement of atoms within molecules, such as chirality.

    Substance: Individual record object collected from depositors, representing a sample used at BioAssay.

    Substance Category: Substance categories (one or more) are assigned to each depositor, based on nature of that depositor's institution and the type of data they supply.

    Suppressed Compound: A Compound CID that links only to an old version substance. A suppressed compound will not be available within Entrez.

    Synonyms: All names, trivial names, synonyms, frequently used IDs, and other names collected from depositors. In the compound summary page, synonyms are distinct synonyms from all corresponding substances.

    TPSA -- Topological Polar Surface Area. This is an estimate of the area (in Å squared) which is polar. The implementation follows [J. Med. Chem. 2000, 43, 3714-3717.]. It is a simple method - only N and O are considered, 3D coordinates are not used, and there are various precomputed factors for different hybridizations, charges and participation in aromatic systems.

    Version: PubChem substance version number is incremented when an update is provided by the depositor.

    Xref: The external references/links to PubChem database records.

    XLogP: A partition coefficient or distribution coefficient that is a measure of differential solubility of a compound in two solvents. Learn more..
    From Feburary 2009, the PubChem uses version 3 of the algorithm to generate the XlogP value. [J. Chem. Inf. Model. 2007, 47, 2140-2148.]. You can also visit the XLogP3 website: http://www.sioc-ccbg.ac.cn/software/xlogp3/.