A Categorical Structure-Activity Relationship Analysis of the Developmental Toxicity of Antithyroid Drugs

The choice of therapeutic strategies for hyperthyroidism during pregnancy is limited. Surgery and radioiodine are typically avoided, leaving propylthiouracil and methimazole in the US. Carbimazole, a metabolic precursor of methimazole, is available in some countries outside of the US. In the US propylthiouracil is recommended because of concern about developmental toxicity from methimazole and carbimazole. Despite this recommendation, the data on developmental toxicity of all three agents are extremely limited and insufficient to support a policy given the broad use of methimazole and carbimazole around the world. In the absence of new human or animal data we describe the development of a new structure-activity relationship (SAR) model for developmental toxicity using the cat-SAR expert system. The SAR model was developed from data for 323 compounds evaluated for human developmental toxicity with 130 categorized as developmental toxicants and 193 as nontoxicants. Model cross-validation yielded a concordance between observed and predicted results between 79% to 81%. Based on this model, propylthiouracil, methimazole, and carbimazole were observed to share some structural features relating to human developmental toxicity. Thus given the need to treat women with Graves's disease during pregnancy, new molecules with minimized risk for developmental toxicity are needed. To help meet this challenge, the cat-SAR method would be a useful in screening new drug candidates for developmental toxicity as well as for investigating their mechanism of action.


Introduction
Hyperthyroidism occurs in approximately 2/1000 pregnancies with the majority of those cases being Graves' disease [1], an autoimmune disorder caused by TSHreceptor stimulating autoantibodies. Propylthiouracil (PTU) is the primary treatment of hyperthyroidism in the US followed by methimazole (MMI) while carbimazole (CMI) is not distributed in the US. Surgery and radioiodine are not recommended treatment modalities during pregnancy. Untreated hyperthyroidism during pregnancy leads to developmental toxicity which includes spontaneous abortion, prematurity, growth restriction, and fetal death [1,2]. It is not clear if untreated hyperthyroidism during pregnancy leads to structural malformations [3,4]; however one dataset suggested that the risk of malformations was greater than expected among infants from untreated than treated pregnancies [5]. Additionally, there is concern that one of the medical therapeutic strategies, MMI, may be a weak developmental toxicant, producing structural malformations [6], as well as fetal goiter. Part of this concern stems from the observation that CMI is a weak developmental toxicant and one of its metabolites is MMI.
On the one hand, as just mentioned, treatment of hyperthyroidism during pregnancy may result in developmental toxicity, as the available drugs cross the placenta and can cause fetal goiter. On the other hand, while the data remain unclear, there is evidence suggesting that maternal hypothyroidism is associated with impaired fetal neurodevelopment. Consequently the clinician must balance the use of the antithyroid medications against the potential developmental consequences of inadequate or aggressive therapy, with a limited set of therapeutic options.

International Journal of Pediatric Endocrinology
Several assumptions about PTU and MMI have guided the medical treatment strategies for Graves' disease during pregnancy in the US [2,[6][7][8]. It has been assumed that placental transport of PTU is less than MMI, it has been assumed that fetal and neonatal thyroid effects of PTU is less than with MMI (based on lower placental transport), it has been assumed that MMI exposure during pregnancy leads to greater impairment of neurocognitive development than with PTU exposure, and it has been assumed that MMI produces structural defects and that PTU does not. A thoughtful review and analysis of literature by Mandel has demonstrated that these assumptions are not correct [2,6].
Given the uncommon nature of Graves' disease in pregnancy and the weak link between the antithyroid medications and malformations, we have taken a different approach to assess the potential developmental toxicity of the drugs which are being used to treat the disease in pregnancy. Several years ago we created datasets to evaluate the structural determinants of developmental toxicity in experimental animals and humans [9][10][11][12][13]. Statistical analyses demonstrated that animal models are reasonable predictors of human developmental toxicity [14], and that rules could be agreed upon among experts in developmental toxicity for evaluating animal and human data [9]. Subsequently we evaluated the utility of structure-activity relationship (SAR) models generated by MultiCASE for studying and predicting developmental toxicity in diverse species including humans [12]. This later dataset of chemicals assessed for human developmental toxicity has subsequently been utilized to create a more transparent and robust model of developmental toxicity using the categorical-SAR (cat-SAR) expert system.
Briefly, the cat-SAR expert system diverges from other SAR expert systems wherein there is a high degree of user flexibility in both learning set development and model parameterization [15]. Cat-SAR analysis allows the user to specify adjustable modeling attributes including the selection of size of the 2-dimensional fragments, whether or not to include hydrogen atoms in the analysis, and rules for identifying important fragments for the final model. Hence, the selection of compounds included in the learning set and control over various model attributes provides the user with the ability to rigorously explore the relationships between chemical structure and biological activity. Application of the cat-SAR expert system to a toxicological or pharmacological endpoint is thus not constrained wherein a given set of data must fit the attributes of a predefined and often proprietary modeling process.

Human Developmental Toxicity
Dataset. Data on human developmental toxicity were derived from the teratogen information system and a database that utilized the US FDA guidelines as described previously [12]. The chemicals in this database were specifically characterized with respect to risk for human developmental toxicity including death, growth retardation and functional and structural abnormalities.

Cat-SAR Structure Activity Relationship Expert System.
The cat-SAR approach is a computational SAR or in silico toxicity analysis and prediction "expert system." In previous analyses, the cat-SAR program was able to achieve an overall concordance between observed and predicted values of 92% for a set of chemicals assessed for their ability to induce respiratory hypersensitivity [16], 80%-90% for a set of environmental estrogen mimics [17], and 78%-84% for a set of rat mammary carcinogens [15].
Cat-SAR models are built through a comparison of structural features found amongst categorized compounds in the model's learning set. Generically, these categories are toxicologically active and inactive compounds. Essentially, the cat-SAR approach is transparent in the development of the learning set, the identification of fragments, and the determination of significant or important ones. Moreover, the approach allows user intervention and model optimization throughout the modeling process. This method includes the ability to examine the entire fragment base and to explore and optimize the fragments that have perceived biological relevance.
Moreover, since cat-SAR analyzes categorical data and 2-dimensional fragments rather than intact chemicals, the program can examine noncongeneric datasets that are divided into categories of activity rather than degrees of potency as in the case of quantitative SAR (QSAR). Thus, unlike Hansch and conformational molecular field analysis (CoMFA) approaches that require continuous-type data, cat-SAR works by identifying molecular attributes associated with biological activity by comparing attributes of active (e.g., teratogenic) to inactive (e.g., nonteratogenic) compounds. The models and subsequent predictions based on this dichotomy can then be used to examine structural features associated with teratogenicity and predict the likelihood of teratogenic activity of unknown compounds, respectively.
Overall, the cat-SAR models discussed herein for developmental toxicity demonstrate a high degree of predictivity and mechanistically interpretability and can be useful for screening new drug candidates for developmental toxicity as well as for investigating their therapeutic and toxic mechanisms of action.

Learning Set Development.
The cat-SAR models are built through a comparison of structural features found amongst two designated categories of compounds in the model's learning set. As mentioned, for these analyses the categories were developmental-toxicity and nondevelopmental-toxicity. The cat-SAR learning set consists of the chemical name, its structure as a MOL2 file, and its categorical designation (e.g., one or zero for active and inactive). Typically, organic salts are included as the freebase and simple mixtures and technical grade preparations may be included as the major or active component, metals, metalo-organic compounds and polymers, and mixtures of unknown composition are not included.

In Silico Chemical Fragmentation and the Compound-Fragment Data
Matrix. Using the Tripos Sybyl HQSAR module, each chemical was fragmented in silico into all possible fragments meeting user-specified criteria. HQSAR allows the user to select attributes for fragment determination including atom counts (i.e., the size of the fragments), bond types, atomic connections (i.e., the arrangement of atoms in the fragment), explicit hydrogen atoms, chirality, and hydrogen bond donor and acceptor groups. Fragments can be linear, branched, or cyclic moieties. For analysis of the developmental toxicity dataset the models developed contained fragments between three and seven atoms in size and considered atoms, bond types, and atomic connections.
Upon completion of the fragmentation routine a Sybyl HQSAR add-on is used to produce a compound-fragment data matrix. In the matrix, the rows are intact chemicals and columns are the molecular fragments. Thus for each chemical, a tabulation of all its fragments is recorded across the table rows and for each fragment all chemicals that contain it are tabulated down the columns.
The HQSAR module is not used for statistical analysis or model development. Rather, the compound-fragment matrix is subsequently analyzed with the cat-SAR expert system in order to identify structural features associated with the categorized active and inactive compounds. The cat-SAR program, human developmental toxicity database, and the compound-fragments matrix are available through the corresponding author.

Identifying "Important" Fragments of Activity and
Inactivity. A measure of each fragment's association with biological activity is next determined. To ascertain an association between chemical descriptors (i.e., fragments) and a chemical's activity (or inactivity), a set of rules is used to choose "important" from "unimportant" descriptors. The first selection rule (the Number Rule) is the number of chemicals identified in the learning set that possesses each particular fragment. The second selection rule (the Proportion Rule) is the proportion of active or inactive chemicals that then possesses the particular fragment. Although previously published cat-SAR models required the user to select specific values for the Number and Proportion Rules, a new routine was implemented here to determine optimal values for the Number and Proportion rules. For this exercise the values for the Number Rule were allowed to range between one and eight and the initial values for the Proportion Rule were allowed to range between 0.50 and 0.95.

Predicting Activity.
The resulting list of important fragments can then be used for mechanistic analysis, or to predict the activity of an unknown compound. In the latter circumstance, the model determines which, if any, fragments from the model's learning set the test compound contains. If none are present, no prediction of activity is made for the compound (i.e., no default prediction). If one or more fragments are present, the number of active and inactive compounds containing each fragment is determined. Here the fragment sum (FragSum) method calculates the average probability of the active and inactive fragments contained in a compound and is weighted to the number of active and inactive compounds that go into deriving each fragment. For example, if a compound contains two fragments, one being found in 9/10 active compounds in the learning set (90% active) and the other being found in 3/3 inactive compounds (0% active), the unknown compound will be predicted to be have a probability of activity of 69% (9/10 actives + 0/3 actives = 9/13 actives or 69% chance of activity).
As described, a Cat-SAR prediction of activity or inactivity is based on two separate fragment sets (i.e., the active fragments and the inactive ones) and the predicted activity of a chemical is based on the average probability of all the active and inactive compounds contributing to its structure. Therefore, to classify compounds back to an active or inactive category (i.e., rather than a probability of activity), the program identifies an optimal cutoff point that best separates the prediction of active and inactive compounds based on the probabilistic values of activity derived from a model validation analysis [18]. Depending on the application of the model, the cutoff point that separates active from inactive categorization, for example can be adjusted wherein a model with the best overall concordance can be selected (i.e., a most predictive model), one with equal sensitivity and specificity (i.e., a balanced model that does not overly predictive active compounds at the cost of wrongly predicting inactive one and viceversa), or one with high sensitivity (i.e., a risk averse model).

Model Validation.
A self-fit, leave-one-out (LOO), and multiple leave-many-out (LMO) cross-validations were conducted for each model (see Table 1). For the self-fit, a model was developed from the complete learning set of 323 compounds and that model was used to predict the activity of each compound in the learning set to determine the general robustness of the model. For the LOO validation, each chemical, one at a time, was removed from the model's total fragment set, and an n-1 model was derived. The activity of the removed chemical was then predicted using the n-1 model. Predicted vs. experimental values for each chemical were then compared and the model's overall concordance, sensitivity, and specificity were determined, where, For the LMO validation, randomly selected sets of 5% of the chemicals were removed from the model, and the n-5% model was derived for each set. The activity of each of the removed chemicals was then predicted using the n-5% model. Predicted versus experimental values for the chemicals in the left out sets were then compared and the n-5% model's concordance, sensitivity, and specificity were determined. This was repeated 10 000 times to compute the model's average concordance, sensitivity, and specificity.

Model Analysis and Validation.
Together, two cat-SAR developmental toxicity models were produced. Model 1 was designed to have near-equal sensitivity and specificity while Model 2 was designed to demonstrate the best overall concordance between observed and predicted values (Table 1). Overall, 26 401 unique chemical fragments between 3-7 nonhydrogen atoms were derived for the developmental toxicity database (Table 1). Of these, for Model 1, 1815 were associated with developmental toxicants and 2707 with nontoxicants (4522 total significant fragments). And for Model 2, 1027 were associated with developmental toxicants and 2413 with nontoxicants (3440 total significant fragments).
The self-fit analysis yielded concordance between observed and predicted results of 94% (274/291) for the balanced sensitivity and specificity model (Model 1) and 99% (250/252) for the best concordance model (Model 2). These high concordance rates indicate a robust model wherein there is sufficient structural information contained in the learning set to distinguish between active and inactive compounds. The LOO cross-validation yielded a concordance of 79% for the balanced sensitivity and specificity model (Model 1) and 81% for the best concordance one (Model 2). The LMO cross-validation yielded an average concordance of 78% for the balanced sensitivity and specificity model (Model 1) and 80% for the best concordance one (Model 2). Overall, since the LOO and LMO validation results are in near agreement, it is estimated that the cat-SAR developmental toxicity model is ∼80%, accurate for predicting the developmental toxicity potential of chemicals not included in the model's learning set.
In order to better judge how well these two models performed in general, one can consider the "accuracy" or reproducibility of a standard in vitro toxicological test. For instance, the US National Toxicology Program's (NTP) Salmonella mutagenicity database, which is derived from a standardized protocol, has been estimated to be about 85% reproducible (49). Based on these findings wherein the concordance between observed and predicted values for human developmental toxicity is ∼80% the cat-SAR human developmental toxicity models thus appear to be as predictive as data used to develop SAR models from standardized in vitro assays.

Analysis of Propylthiouracil, Methimazole, and Carbimazole.
For these analyses, the balanced sensitivity and specificity model (Model 1) was used for analysis of PTU, MMI, and CMI. This model identified 13 fragments within the three antithyroid medications relating to developmental toxicity ( Figure 1). Nine of the 13 fragments were considered to be associated with developmental toxicants and 4 of the 13 fragments were considered to be associated with nondevelopmental toxicants.
Fragments 2184, 2182, and 6241 are found in fluorouracil, for which there is a small amount of human data on developmental toxicity. However the compound is known to interfere with DNA synthesis and has been shown to be a developmental toxicant in rat, mouse, rabbit, hamster, guinea pig, and in nonhuman primates ( [19], TERIS MicroMedex accessed 06 July 2009).
Fragments 2184, 2182, and 6229 are found in a series of barbiturates used as sedative-hypnotics and anticonvulsants. Literature on the developmental toxicity of the sedativehypnotics compounds is contradictory [19]. Amobarbital has been associated in several human studies with cardiovascular malformations although no experimental animal studies appear to have been conducted [19]. Phenobarbital use throughout pregnancy, as an anticonvulsant, has been associated with facial and cardiovascular malformations in humans and experimental animals (TERIS MicroMedex accessed 06 July 2009). Mephobarbital is metabolized to phenobarbital and has been associated with increased risk of facial malformations (TERIS MicroMedex accessed 06 July 2009). There is little animal and human data on metharbital, used as an anticonvulsant, or on butalbital used as a sedative hypnotic.
Fragments 6172 and 6206 are found in thioguanine used as an antineoplastic and in treatment of Crohn's disease. Thioguanine has been demonstrated to produce malformations in experimental animals and in humans ( [19], TERIS MicroMedex accessed 06 July 2009).  The additional fragments associated with developmental toxicity, 9479, 9564, and 9565, are found in a series of tetracycline like antibiotics. Tetracycline is identified as a developmental toxicant because of incorporation into teeth and evidence of alteration in bone growth ( [19], TERIS MicroMedex accessed 06 July 2009); because of structural similarity and mechanism of action the other tetracyclines are also considered to be developmental toxicants.
There were 4 fragments which were found in molecules thought not to be developmental toxicants: 6279, 6298, 6308, and 6332. The molecules these fragments were found in are cephalosporin, antibiotics and across this class of compounds there is evidence of the absence of risk for developmental toxicity in both experimental animals and humans ( [19], TERIS MicroMedex accessed 06 July 2009).
The three antithyroid drugs analyzed represent the limited choices available to the health care provider who is caring for a woman with Graves's disease during pregnancy. These are old drugs, all available as generics; however despite their long history of use there is little data on the developmental risks resulting from use and exposure during pregnancy. All three drugs are capable of crossing the placenta and can cause fetal goiter. Structural malformations are weakly associated with MMI and CMI use during pregnancy. Thus while PTU has been suggested to be without risk of developmental toxicity in humans, that assumption appears to reflect a shared attitude rather than presence of data demonstrating developmental safety. An additional concern with the use of PTU is liver toxicity leading to liver failure and need for transplant [20]. Analysis of data submitted to the US Food and Drug Administration (FDA) suggests that children may be at increased risk for PTU-induced liver failure, and cases have been observed of fetal hepatic toxicity in women treated with PTU (FDA Adverse Events Reporting 6 International Journal of Pediatric Endocrinology System data obtained through the freedom of information act). Additionally there have been cases of infants with malformations observed in women treated with PTU, and structural malformations have been observed in women with Graves's disease who were not treated during pregnancy.

Conclusions
The choice of therapeutic strategies for hyperthyroidism during pregnancy is limited. Surgery and radioiodine are typically avoided, leaving propylthiouracil and methimazole in the US [21]. CMI, a metabolic precursor of MMI, is available in some countries outside of the US. In the US PTU is recommended because of concern about developmental toxicity from MMI [22] and CMI [23].
In summary the three drugs available to treat Graves's disease in pregnancy all appear, based on structural analysis, to be capable of producing developmental toxicity. Given the need to treat women with Graves's disease during pregnancy it is essential to develop new molecules with structural attributes which provide suppression of thyroid function while minimizing the risk for developmental toxicity. Based on the results described herein, the cat-SAR method would be a useful approach in screening compounds for developmental toxicity as well as for investigating their therapeutic and potential toxic mechanism of action.