0907 system 2 account mobiiibi kgk bah

16
Creation of a Highly Detailed, Dynamic, Global Model and Map of Science Kevin W. Boyack SciTech Strategies, Inc., 8421 Manuel Cia Pl NE, Albuquerque, NM 87122. E-mail: [email protected] Richard Klavans SciTech Strategies, Inc., 2405 White Horse Road, Berwyn, PA 19312. E-mail: [email protected] The majority of the effort in metrics research has addressed research evaluation. Far less research has addressed the unique problems of research planning. Models and maps of science that can address the detailed problems associated with research planning are needed. This article reports on the creation of an article-level model and map of science covering 16 years and nearly 20 million articles using cocitation-based techniques. The map is then used to define discipline- like structures consisting of natural groupings of articles and clusters of articles. This combination of detail and high-level structure can be used to address planning-related problems such as identification of emerging topics and the identification of which areas of science and technology are innovative and which are simply persisting. In addition to presenting the model and map, several process improvements that result in greater accuracy structures are detailed, including a bib- liographic coupling approach for assigning current papers to cocitation clusters and a sequential hybrid approach to producing visual maps from models. Introduction The majority of the effort in metrics (sciento-, biblio-, infor-, alt-) studies has been aimed at research evaluation (Martin, Nightingale, & Yegros-Yegros, 2012). Examples of evaluation-related topics include impact factors, the h-index and other related indexes, university rankings, and national- level science indicators. The 40+ year history of the use of document-based indicators for research evaluation includes publication of handbooks (Moed, Glänzel, & Schmoch, 2004), the introduction of new journals (e.g., Scientomet- rics, Journal of Informetrics), and several annual or biannual conferences (e.g., ISSI, Collnet, STI/ENID) specifically aimed at reporting on these activities. The evaluation of research using scientific and technical documents is a well- established area of research. Far less effort in metrics research has focused on the unique problems of research planning. Planning-related questions (Börner, Boyack, Milojevic ´, & Morris, 2012) that are asked by funders, administrators, and researchers are different from evaluation-related questions. As such, they require models different from those that are used for evalu- ation. For example, planning requires a model that can predict emerging topics in science and technology. Funders need models that can help them identify the most innovative and promising proposals and researchers. Administrators, particularly those in industry, need models to help them best allocate their internal research funds, including knowing which existing areas to cut. To allow detailed planning, document-based models of science and technology have to be highly granular and, though based on retrospective data, must be robust enough to allow forecasting. Overall, the technical requirements of a model that can be used for plan- ning are unique. Research and development of such models is an underdeveloped area in metrics research. This article reports on a cocitation-based model and map of science covering nearly 20 million articles over 16 years that has the potential to be used to answer planning-related questions. Although the model is similar in concept to one that has been previously reported (Klavans & Boyack, 2011), it differs in several significant aspects: it uses an improved current paper assignment process, it uses an improved map layout process, and it has been used to create article-level, discipline-like groupings to provide context for its detailed structure. In the balance of this article, we first review related work to provide context for the work reported here. We then detail the experiments that led to the improve- ments in our map-creation process. This is followed by introductions and characterizations of the model and map, along with a description of how the map was divided into Received January 23, 2013; revised April 10, 2013; accepted April 11, 2013 © 2013 ASIS&T Published online 20 November 2013 in Wiley Online Library (wileyonlinelibrary.com). DOI: 10.1002/asi.22990 JOURNAL OF THE ASSOCIATION FOR INFORMATION SCIENCE AND TECHNOLOGY, 65(4):670–685, 2014

Upload: upload123

Post on 09-Jul-2020

11 views

Category:

Investor Relations


0 download

DESCRIPTION

Xjxn mcm uus

TRANSCRIPT

Page 1: 0907 system 2 account mobiiibi kgk bah

Creation of a Highly Detailed, Dynamic, Global Modeland Map of Science

Kevin W. BoyackSciTech Strategies, Inc., 8421 Manuel Cia Pl NE, Albuquerque, NM 87122.E-mail: [email protected]

Richard KlavansSciTech Strategies, Inc., 2405 White Horse Road, Berwyn, PA 19312. E-mail: [email protected]

The majority of the effort in metrics research hasaddressed research evaluation. Far less research hasaddressed the unique problems of research planning.Models and maps of science that can address thedetailed problems associated with research planningare needed. This article reports on the creation of anarticle-level model and map of science covering 16 yearsand nearly 20 million articles using cocitation-basedtechniques. The map is then used to define discipline-like structures consisting of natural groupings ofarticles and clusters of articles. This combination ofdetail and high-level structure can be used to addressplanning-related problems such as identification ofemerging topics and the identification of which areas ofscience and technology are innovative and which aresimply persisting. In addition to presenting the modeland map, several process improvements that result ingreater accuracy structures are detailed, including a bib-liographic coupling approach for assigning currentpapers to cocitation clusters and a sequential hybridapproach to producing visual maps from models.

Introduction

The majority of the effort in metrics (sciento-, biblio-,infor-, alt-) studies has been aimed at research evaluation(Martin, Nightingale, & Yegros-Yegros, 2012). Examples ofevaluation-related topics include impact factors, the h-indexand other related indexes, university rankings, and national-level science indicators. The 40+ year history of the use ofdocument-based indicators for research evaluation includespublication of handbooks (Moed, Glänzel, & Schmoch,2004), the introduction of new journals (e.g., Scientomet-rics, Journal of Informetrics), and several annual or biannualconferences (e.g., ISSI, Collnet, STI/ENID) specifically

aimed at reporting on these activities. The evaluation ofresearch using scientific and technical documents is a well-established area of research.

Far less effort in metrics research has focused on theunique problems of research planning. Planning-relatedquestions (Börner, Boyack, Milojevic, & Morris, 2012) thatare asked by funders, administrators, and researchers aredifferent from evaluation-related questions. As such, theyrequire models different from those that are used for evalu-ation. For example, planning requires a model that canpredict emerging topics in science and technology. Fundersneed models that can help them identify the most innovativeand promising proposals and researchers. Administrators,particularly those in industry, need models to help them bestallocate their internal research funds, including knowingwhich existing areas to cut. To allow detailed planning,document-based models of science and technology have tobe highly granular and, though based on retrospective data,must be robust enough to allow forecasting. Overall, thetechnical requirements of a model that can be used for plan-ning are unique. Research and development of such modelsis an underdeveloped area in metrics research.

This article reports on a cocitation-based model and mapof science covering nearly 20 million articles over 16 yearsthat has the potential to be used to answer planning-relatedquestions. Although the model is similar in concept to onethat has been previously reported (Klavans & Boyack,2011), it differs in several significant aspects: it uses animproved current paper assignment process, it uses animproved map layout process, and it has been used to createarticle-level, discipline-like groupings to provide context forits detailed structure. In the balance of this article, we firstreview related work to provide context for the work reportedhere. We then detail the experiments that led to the improve-ments in our map-creation process. This is followed byintroductions and characterizations of the model and map,along with a description of how the map was divided into

Received January 23, 2013; revised April 10, 2013; accepted April 11, 2013

© 2013 ASIS&T • Published online 20 November 2013 in Wiley OnlineLibrary (wileyonlinelibrary.com). DOI: 10.1002/asi.22990

JOURNAL OF THE ASSOCIATION FOR INFORMATION SCIENCE AND TECHNOLOGY, 65(4):670–685, 2014

Page 2: 0907 system 2 account mobiiibi kgk bah

discipline-like groupings. The article concludes with a briefdiscussion of how the map will be used in the future for avariety of analyses.

Background

Science mapping, when reduced to its most basic com-ponents, is a combination of classification and visualization.We assume that there is a structure to science, and then wecreate a representation of that structure by partitioning setsof documents (or journals, authors, grants, etc.) into differ-ent groups. This act of partitioning is the classification partof science mapping and typically takes the majority of theeffort. The resulting classification system, along with somerepresentation of the relationships between the partitions,can be thought of as a model of science inasmuch asit specifies structure. The visualization part of sciencemapping uses the classification and relationships as inputand creates a visual representation (or map) of that model asoutput.

Mapping of scientific structure using data on scientificpublications began not long after the introduction of theInstitute for Scientific Information (ISI)’s citation indices inthe 1950s. Since then, science mapping has been performedat a variety of scales and with a variety of data types. Sciencemapping studies have been roughly evenly split amongdocument-, journal-, and author-based maps. Both text-based and citation-based methods have been used. Many ofthese different types of studies have been reviewed at inter-vals in the past (Börner, Chen, & Boyack, 2003; Morris &Martens, 2008; White & McCain, 1997). Each type of studyis aimed at answering certain types of questions. Forexample, author-based maps are most often used to charac-terize the major topics within an area of science and to showthe key researchers in those topic areas. Journal-basedmodels and maps are often used to characterize discipline-level structures in science. Overlay maps based on journalscan be used to answer high-level policy questions (Rafols,Porter, & Leydesdorff, 2010). However, more detailed ques-tions, such as questions related to planning, require the useof document-level models and maps of science. The balanceof this section thus focuses on document-level models andmaps.

When it comes to mapping document sets, most studieshave used local data sets. The term local is used here todenote a small set of topics or a small subset of the whole ofscience. Although these local studies have been used suc-cessfully to improve mapping techniques and to providedetailed information about the areas that they study, weprefer global mapping because of the increased context andaccuracy that are enabled by the mapping of all of science(Klavans & Boyack, 2011).

The context for work presented here lies in the effortsundertaken since the 1970s to map all of science at thedocument level using citation-based techniques. The firstmap of worldwide science based on documents was createdby Griffith, Small, Stonehill, and Dey (1974). Their map,

based on cocitation analysis, contained 1,310 highly citedreferences in 115 clusters, showing the most highly citedareas in biomedicine, physics, and chemistry. Small contin-ued generating document-level maps using cocitation analy-sis (Small, Sweeney, & Greenlee, 1985), typically usingthresholds based on fractional citation counting that endedup keeping roughly (but not strictly) the top 1% of highlycited references by discipline. The mapping process andsoftware created by Small at the ISI evolved to generatehierarchically nested maps with four levels. Small (1999)presented a four-level map based on nearly 130,000 highlycited references from papers published in 1995, which con-tained nearly 19,000 clusters at its lowest level. At roughlythe same time, the Center for Research Planning (CRP) wascreating similar maps for the private sector using similarthresholds and methods (Franklin & Johnston, 1988). Onemajor difference is that CRP’s maps used only one level ofclustering rather than multiple levels.

The next major step in mapping all of science at thedocument level took place in the mid-2000s, when Klavansand Boyack (2006) created cocitation models of more than700,000 reference papers and bibliographic coupling modelsof more than 700,000 current papers from the 2002 file-yearof the combined Science and Social Sciences CitationIndices. Later, Boyack (2009) used bibliographic couplingto create a model and map of nearly 1,000,000 documents in117,000 clusters from the 2003 citation indices. Through2004, the citation indices from ISI were the only compre-hensive data source that could be used for such maps. Theintroduction of the Scopus database in late 2004 providedanother data source that could be used for comprehensivemodels and maps of science. Klavans and Boyack (2010)used Scopus data from 2007 to create a cocitation model ofscience comprising more than 2,000,000 reference papersassigned to 84,000 clusters. More than 5,600,000 citingpapers from 2003 to 2007 were assigned to these cocitationclusters based on reference patterns.

All of the models and maps mentioned to this point havebeen static maps; that is, they were all created using data froma single year and were snapshot pictures of science at a singlepoint in time. It is only recently that researchers have createdmaps of all of science that are longitudinal in nature. In thefirst of these, Klavans and Boyack (2011) extended theircocitation mapping approach by linking together nine annualmodels of science to generate a 9-year global map of sciencecovering 10,360,000 papers from 2000 to 2008. The clustersin this model have been found to be highly recognizable tosubject matter experts (Klavans, Boyack, & Small, 2012).The fact that these small partitions are recognizable units ofscience suggests that these structures are reasonable repre-sentations of the actual topics in science.

More recently, Waltman and van Eck (2012) at LeidenUniversity’s Centre for Science and Technology Studies(CWTS) clustered nearly 10,000,000 documents from theWeb of Science (2001–2010) using direct citation and amodularity-based approach that is similar to their familiarVOS method. This new CWTS approach has advantages

JOURNAL OF THE ASSOCIATION FOR INFORMATION SCIENCE AND TECHNOLOGY—April 2014 671DOI: 10.1002/asi

Page 3: 0907 system 2 account mobiiibi kgk bah

over other approaches: It can be used to generate a multi-level hierarchical clustering, and it can handle largedocument sets with relatively modest computational require-ments. Although it has been used with direct citation simi-larities, there is no reason it could not be used withsimilarities generated from other methods, such as cocita-tion, bibliographic coupling, or even text-based or hybridsimilarity measures.

Improvements in Cocitation Model andMap Creation

We have used cocitation analysis for our models ofscience for many years because of a combination of thefollowing features:

1. Cocitation analysis generates clusters of sufficientaccuracy (Boyack & Klavans, 2010).

2. Cocitation analysis allows fractional assignment ofcurrent papers, which allows clusters to be linked (withinan annual model) through article overlaps (Klavans &Boyack, 2010).

3. On a conceptual level, cocitation analysis generatesmodels in which the cognitive structures tend to be highlyunstable (Garfield, Malin, & Small, 1978). Instability is avery useful characteristic of a model if it is to be usedspecifically for the early identification of emergentphenomena.

4. Reference papers (in cocitation clusters) are a natural andaccurate basis for linking annual clusters into longitudinalstructures (Klavans & Boyack, 2011) that allow explora-tion of the dynamics of science.

Although our modeling choices have been based onexperimentation, it is also true that more feature spaceremains to be explored, and it is likely that more accurateand useful models could be generated through additionalexploration of that space. Given that we expect our maps tobe used to help answer planning-related questions, we con-tinuously seek to improve our modeling and mapping pro-cesses to make them more accurate. This section reports onthree efforts to improve the accuracy and usefulness of ourcocitation models and maps.

Cocitation analysis consists of two high-level steps: (a)clustering of reference papers into cocitation clusters and (b)assigning of the current papers to those cocitation clusters.We have recently explored changes in each of these twosteps in an attempt to increase the accuracy of our models.Our third effort to improve our processes addresses the stepin which a visual map is created from a model.

Cocitation Thresholds

The first step in any cocitation clustering or cocitationanalysis activity is to select the references that are to beclustered. As mentioned previously, early maps of science(e.g., Small et al., 1985) used high thresholds, resulting inmaps containing approximately the top 1% of highly cited

references by discipline. The reasoning behind using highthresholds is that the cocitation clusters would be focused onthe hottest topics in science. We have used much less strin-gent thresholds because of our desire to model the averageand cold science in addition to the hottest science.

The effect of cited reference thresholds on the accuracyof the resulting cocitation clusters has not been systemati-cally investigated. We thus designed an experiment whoseprimary intent was to allow us to determine the approxi-mate fraction of references that should be retained to maxi-mize accuracy. In addition, because reference age is knownto have an effect on structure (van Raan, 2005), we alsoran several cases designed to show the effect of retainingonly the most recent references. Six test cases usingdifferent thresholding criteria were chosen, and each wastested using the 2008 file-year of Scopus data. These sixtest cases were the following:

1. CCA-01: include the top 1% highly cited references,regardless of age.

2. CCA-01R: include the top 1% highly cited recentreferences, from 2002–2008 only.

3. CCA-05: include the top 5% highly cited references,regardless of age.

4. CCA-STS: our published thresholding criteria (Klavans& Boyack, 2011); include references cited five or moretimes in a single year; include references published within3 years with at least age + 1 citations; include referencesof age 0 cited at least twice. These criteria result in retain-ing approximately the top 12% of highly cited references.

5. CCA-R: include all recent references cited at least twice,from 2002 to 2008 only.

6. CCA-H: include the most recent one half of the referencesfor each citing paper.

All six test cases excluded references that were citedmore than 100 times in the 2008 file-year (about 4,000references). These references are excluded because they leadto overaggregation within a cluster solution. We did notapply field normalization for these tests because our currentcriterion (CCA-STS) already has a relatively low citingthreshold. This set of six cases tests the effect of thresholdpercentages (CCA-01, CCA-05, and CCA-STS) and alsotests the effect of recency of references (cases CCA-01R,CCA-R, and CCA-H). We ran these six sets of referencesthrough our cocitation analysis process (Klavans & Boyack,2011), which includes clustering of references, followed byfractional assignment of current papers to the clusters ofreference papers using reference lists.

The accuracy of each cluster solution was measured bycalculating textual coherence using our established method(Boyack & Klavans, 2010). The premise behind thismeasure is that a cluster solution whose clusters have morecoherent (or less divergent) word distributions is more accu-rate than a cluster solution with lower coherence. Said dif-ferently, this measure assumes that clusters with less textualvariation are “better” clusters than those with more textualvariation. Coherence is based on the Jensen-Shannon

672 JOURNAL OF THE ASSOCIATION FOR INFORMATION SCIENCE AND TECHNOLOGY—April 2014DOI: 10.1002/asi

Page 4: 0907 system 2 account mobiiibi kgk bah

divergence (JSD; Lin, 1991), which quantifies the diver-gence between two probability distributions, in this casebetween the word distribution for a document and the worddistribution for the cluster in which the document resides.Calculation of coherence for a full-cluster solution requiresseveral steps. First, JSD is calculated for each document andthen averaged over all documents in a cluster. Second,because JSD is cluster-size dependent, a coherence gain forthe cluster is calculated by subtracting the JSD value for arandom cluster of the same size. Finally, the coherence forthe entire solution is calculated as the cluster-size-weightedaverage of the cluster coherence gains.

Several quantities are shown in Table 1, which gives thenumerical results of this thresholding study. From the stand-point of usefulness in creating cocitation models of science,we are interested in the solution that produces the best mixof high accuracy (Coh-Pap) and high coverage (%Pap).Coh-Pap is the coherence of the cluster solution based onthe cluster assignments of papers from 2008, and %Pap isthe percentage of papers with references that are includedin the solution.

Table 1 shows that coherence and coverage both increasewith the fraction of references included in the solution(compare CCA-01, CCA-05, and CCA-STS). Both quanti-ties are at their highest values when ~12% of references areincluded in the solution. The data also show that limiting theset to recent references has a deleterious effect in terms ofcoverage; some current papers are not included in the solu-tion because they do not reference recent work. The effect oflimiting to current references when using large numbers ofreferences also seems to affect the coherence negatively;CCA-R and CCA-H both have lower coherence values thanthe CCA-STS set. We note that the coherence values for the~5% threshold and the ~12% threshold are within 1% ofeach other, a very small difference. Thus, it is very possiblethat the peak coherence lies somewhere between the two.However, it is also clear that increasing the number of ref-erences included in the calculation also increases coverageof current papers, and coverage is important when using themodel to answer detailed questions related to planning. Thedata show that the CCA-STS thresholding criterion (ourcurrent baseline) provides the best solution among the cri-teria tested in that it has both the highest coverage and thehighest coherence solution.

Incorporating Bibliographic Coupling in the Assignmentof Current Papers

The next area of investigation stemmed from a lingeringquestion that was raised in a previous article that comparedthe accuracy of the cluster solutions calculated using differ-ent citation-based techniques (Boyack & Klavans, 2010).One of the unexpected outcomes of that study was that thedocument clusters created by cocitation analysis had slightlylower coherence than those created by bibliographic cou-pling. Therefore, we had to determine whether this justifieda complete shift to creating yearly models using biblio-graphic coupling. It is important to realize that clustering ofcurrent papers using bibliographic coupling is performed inone step, whereas clustering of current papers using cocita-tion analysis is performed in the following two steps: (a)clustering of reference papers into cocitation clusters and (b)assigning of the current papers to those cocitation clusters.The lower coherence of the cocitation solution could havebeen a consequence of either step.

We focused on whether the reason for lower coherencewas the second step, the assignment of current papers tococitation clusters. In the past, we have performed this stepusing simple fractional assignment based on the distributionof references to clusters for each paper. For example, for apaper with 10 references, if seven of those referencesappeared in one cluster and three in a second cluster, thepaper would be assigned to those two clusters with fractionsof 0.7 and 0.3, respectively.1

To the best of our knowledge, all researchers who haveassigned current papers to cocitation clusters have usedeither dominant assignment (assigning the current paperto the single cluster it references most) or some form ofsimple fractional assignment. We assume this because nomention is made of other assignment techniques. Becausedifferent assignment techniques have not been published, wedecided to design and test an alternate technique in hopes

1Additional detail is needed to replicate the simple fractional assign-ment process. To avoid assigning papers to very large numbers of clusterswith very small fractions each, if a paper has at least two references in anyone cluster, the tail of the distribution is truncated by ignoring clusters inwhich the paper only has one reference. We do this because we assumesingle counts to be potentially random events, whereas signals based on twoor more counts are considered to be nonrandom events.

TABLE 1. Results of the cocitation analysis thresholding study on six test cases.

CCA-01 CCA-01R CCA-05 CCA-STS CCA-R CCA-H

#Clust 20,421 8,230 57,279 97,276 103,090 216,511#Ref 208,353 71,217 878,499 2,434,899 2,909,539 9,669,673%Ref 1.02 0.35 4.32 11.96 14.29 47.50Ref/Clust 10.20 8.65 15.34 25.03 28.22 44.66#Pap 1,190,189 683,457 1,383,053 1,479,574 1,418,618 1,575,590%Pap 73.65 42.29 85.59 91.56 87.79 97.50Pap/Clust 58.28 83.04 24.15 15.21 13.76 7.28Coh-Pap 0.07114 0.07321 0.08097 0.08190 0.07846 0.06976

JOURNAL OF THE ASSOCIATION FOR INFORMATION SCIENCE AND TECHNOLOGY—April 2014 673DOI: 10.1002/asi

Page 5: 0907 system 2 account mobiiibi kgk bah

that it would increase the accuracy (as measured by textualcoherence) of the cluster solution. Simple fraction assign-ment is essentially fractional assignment based on directcitation. Given our previous results showing that biblio-graphic coupling produced a more accurate clustering thandirect citation (Boyack & Klavans, 2010), we decided tocreate a current-paper-assignment approach that uses biblio-graphic coupling. A brief version of the approach (details arein Appendix B) is as follows.

• First, current papers are assigned using the simple fractionalassignment (FA) process.

• Second, a bibliographic coupling (BC) solution is createdusing the method from Boyack and Klavans (2010). Thisgives two distinct solutions of current papers to clusters.

• Third, the two cluster solutions are compared at the paperlevel. The resulting FA:BC cluster pair sums over papers areused to create new fractional assignments.

• Fourth, for papers that appeared in only one of the two solu-tions, the assignment from that single solution is used.

We tested this new bibliographic-coupling-based assign-ment approach against the simpler fractional assignmentapproach using a set of 2.15 million documents (2004–2008) that intersect the Scopus and Medline databases. Weused this set, rather than the Scopus 2008 set that was usedfor the cocitation threshold study, because we had previ-ously calculated both cocitation and bibliographic couplingsolutions for these data (Boyack & Klavans, 2010), and theywere thus available for use without needing additional work.

To maintain the same basis of comparison that was usedin our previous studies (Boyack et al., 2011; Boyack &Klavans, 2010), the accuracies of the two solutions werecompared using two different metrics, textual coherence (asdetailed above) and a concentration index based on NIHgrant-to-article linkages from Medline. The premise forusing grant-to-article linkages as a metric for measuring theaccuracy of a cluster solution is the assumption that thearticles acknowledging a single grant (especially smallgrants, e.g., R01 grants) should be highly related and shouldbe concentrated in a cluster solution of the document space.With this approach, a cluster solution that does a better jobof concentrating articles associated with individual grants ismore accurate than one that does not concentrate thesearticles as well. One positive benefit of using grant-to-articlelinkages is that they are an extrinsic measure of quality; theyare not used in the clustering in any way and thus do not biasthe results.

The concentration measurement is limited to those grantsthat can show a concentrated solution. We limited the grant-to-article linkage set to those grants that have produced aminimum of four articles. The resulting test set consisted of571,405 separate links between 262,959 unique articles(over 12% of the corpus) and 43,442 NIH grants. We use astandard Herfindahl index as a measure of concentration.This is calculated for each grant, i, as Hi = S(ni,j/ni)2, whereni,j is the number of articles acknowledging grant i in clusterj, and ni is the total number of articles acknowledging

grant i. An overall value for each cluster solution is thencalculated as the weighted average over all grants, H = niHi/(Sni).

Table 2 shows that the coherence and concentrationmetrics are both significantly higher for our newbibliographic-coupling-based assignment approach than forthe simple fractional assignment (CC-FA) approach.We labelthis new approach as CC-BC because it combines standardcocitation clustering of references with a bibliographic-coupling-based assignment of current papers. The coherencefor the CC-BC approach is also higher than that for a standardbibliographic coupling (BC) approach. BC alone has a higherconcentration index than the CC-BC approach, but not bymuch. Given the size (more than 2 million articles) and scopeof the data set, we take these results as an indicator that acombined cocitation/bibliographic coupling approach ispreferable to using either approach separately and haverevised our approach to modeling accordingly.

Sequentially Hybrid Map Layout

Once a model (i.e., in our case, a cluster solution) hasbeen created from a set of documents, it is often useful tocreate a visual map of the model. One straightforward wayof doing this is to create a graph layout of the clusters in themodel. Several methods have been devised for this type ofvisualization. However, the most common layout algorithmsin use today (e.g., Fruchterman-Reingold, Kamada-Kawai)are typically only used to generate layouts for small data sets(hundreds of objects). We use the OpenOrd (formerly DrL)algorithm to generate layouts for sets of hundreds of thou-sands of objects (Martin, Brown, Klavans, & Boyack, 2011).

For many years, we have generated visual maps of ourlarge-scale cocitation models using cocitation between pairsof clusters. This choice was based on tradition; for example,Small (1999) used sequential cocitation analysis steps tocreate a hierarchical set of document clusters. To do this, onetakes the original list of citing–cited article pairs, replacesthe cited articles with their cluster numbers, and then runsthe same algorithm used in the original cocitation similaritycalculation but with cluster numbers as the cited items. Thisresults in a set of cluster–cluster similarity values based oncocitation at the cluster level. This set of similarity valuescan then be used as input to a graph layout algorithm tocalculate the cluster coordinates that, when plotted, form

TABLE 2. Comparison of the accuracies of two different methods forassigning current papers to cocitation clusters.a

BC CC-FA CC-BC

Coh-Pap 0.08599 0.08167 0.08865(+5.3%) (+8.5%)

Herf 0.28486 0.23778 0.27516(+19.8%) (+15.7%)

aBoth current paper assignment methods are compared with simplebibliographic coupling (BC). Percentages show the gain above the frac-tional assignment solution.

674 JOURNAL OF THE ASSOCIATION FOR INFORMATION SCIENCE AND TECHNOLOGY—April 2014DOI: 10.1002/asi

Page 6: 0907 system 2 account mobiiibi kgk bah

a visual map of the model. As an example, Figure 1 (left)shows the map of 116,163 article clusters (cocitation clus-ters) from the 2010 file-year of Scopus data created usingthis approach. Although this map shows the relative posi-tions of major areas of science, we have never found thistype of map to be as appealing and informative as we wouldlike, because everything is so bunched together. There isvery little white space in this visual map; there are fewgrouping of clusters that differentiate themselves from theoverall mass that could represent discipline-level structures.

Prior research has shown us that different similarity typesgive rise to different types of visual cluster patterns. Typicalpatterns range from the space-filled mass exemplified inFigure 1 (left) to concentrated areas tied together withstringy pathways to very tight clusters with no pathwaysbetween them (Boyack, Klavans, & Börner, 2005). Whitespace increases along this continuum. Aesthetically, wefavor the middle ground, cluster solutions that show sets ofislands connected by pathways. These types of visual struc-tures imply that there are multiple high-level areas in sciencethat are differentiated from each other and that these areconnected.

A test was run to determine whether a cluster–clustersimilarity based on textual analysis might create a differentvisual cluster pattern, one that was both more appealing and

more accurate. Text-based similarity measures require muchmore computation than citation-based measures because thecluster–cluster matrix is typically much less sparse for textthan for citations. Calculating a cluster–cluster similaritymatrix based on textual characteristics for 100,000 clustersor more is thus a significant undertaking. Many differenttext-based measures are available for such a calculation. Wechose to use the BM25 measure (Spärck Jones, Walker, &Robertson, 2000a, 2000b) because it is simple to calculate,is among the least computationally expensive textapproaches, and is among the most accurate text measures(Boyack et al., 2011; Lin & Wilbur, 2007). Each cluster wasrepresented textually as comprising the titles and abstractsof its papers. BM25 was then used to calculate cluster–cluster scores. The BM25 similarity between one object, q,and another object, d, is calculated as

s q d IDFn k

n k b bD

D

ii

ii

n

,( ) =+( )

+ − +⎛⎝⎜

⎞⎠⎟

⎜⎜⎜

⎟⎟⎟=

∑ 1

11

1

1

where ni is the frequency of term i in object d. Notethat ni = 0 for terms that are in q but not in d. Typicalvalues were chosen for the constants k1 and b (2.0 and 0.75,

FIG. 1. Visual maps of cocitation clusters from a Scopus 2010 model, where the cluster layout was based on cocitation similarities (left) and BM25 textsimilarities (right) between clusters. Map unit scales are shown in actual units (left) and log units (right). [Color figure can be viewed in the online issue,which is available at wileyonlinelibrary.com.]

JOURNAL OF THE ASSOCIATION FOR INFORMATION SCIENCE AND TECHNOLOGY—April 2014 675DOI: 10.1002/asi

Page 7: 0907 system 2 account mobiiibi kgk bah

respectively). In our formulation, each cluster was treated asif it were a single document. Document length |D| wasestimated by adding the term frequencies ni per document.Average document length D is computed over the entire setof documents. The IDF value for a particular term, i, iscomputed as

IDFN n

ni

i

i

=− +

+log

.

.

0 5

0 5

where N is the total number of documents in the data setand di is the number of documents containing term i. Eachindividual term in the summation in the first formulais independent of document q. To remove the influenceof high-frequency terms, all terms with IDF scores below2.0 were discarded.

Figure 1 (right) shows the visual map of cocitation clus-ters that resulted from using BM25 to calculate cluster–cluster similarities. Not all similarity values were used in thecalculation. Rather, the similarity file was filtered to thetop-N similarities per cluster, and layout was done withOpenOrd using the default edge-cutting parameter. Thenumber of similarities N per cluster varied between 5 and 15and was scaled to sum(BM25) using the top50 similaritiesper cluster. These same steps were also used for the cocita-tion map at left in Figure 1, allowing a comparison of thetwo maps that can be definitively tied to the differences insimilarity measures.

The similarities and differences between these two mapsare quite striking. Once the layout is given the same orien-tation (Physics at the top and the Social Sciences at left), therelative location of each field is almost exactly the same.Biology is in the middle, Chemistry is at right, and Medicineis at the bottom. Earth Sciences is above Biology andborders Computer Science and Engineering. Brain Researchborders the Social Sciences, Health Sciences, and Medicinein both maps. The ordering and relative locations of largefields are not affected by the use of a cocitation vs.text-based cluster–cluster similarity measure. This is to beexpected given the consensus in high-level map structuresthat has recently been noted by multiple researchers(Klavans & Boyack, 2009; Leydesdorff & Rafols, 2009).

The major difference between these two maps is theexistence of interior white space in the text-based map atright. Interior white space is an extremely attractive charac-teristic in that it may represent the between-cluster gaps thatare at the heart of a network-based theory of innovation(Chen et al., 2009). The BM25 similarity values are an orderof magnitude higher than the cocitation similarity values, asshown in Figure 2, which suggests that the lower similaritieslead to much more even spacing between nodes in the map,whereas higher similarity values create a map with muchmore well-defined groupings of clusters. This should alsonot be surprising. The first level of clustering in both mapswas based on cocitation, using up a high fraction of thevariance in the system that could be accounted for usingcocitation. A second level of similarity using cocitation

should thus have far less signal available with which to linkclusters than would textual analysis, the majority of whosesignal would still be available.

We find the text-based map to be far more visually com-pelling than the cocitation map because of the localizeddensity of clusters, the greater amount of white space, andthe visible pathways between localized areas that indicateconnections between discipline-like structures. This newmapping technique can be considered as a hybrid technique.Although the first-level similarity metric is not acitation + text hybrid, this technique uses a citation-basedmethod to generate clusters, followed by a text-basedmethod to generate a cluster layout, and is thus a sequen-tially hybrid map-layout technique.

The arguments above speak to the aesthetic nature of thesequentially hybrid map and give anecdotal arguments forwhy this approach may capture more of the variance in thesystem than using two sequential cocitation steps. In addi-tion to anecdotal evidence, we present some numerical evi-dence that the sequentially hybrid layout is more accuratethan the map based purely on cocitation analysis. For eachmap in Figure 1, all possible pairs of articles with a commonauthor were located. For instance, if a researcher authoredfour articles (A, B, C, D), the set of distinct pairs of thosepapers is AB, AC, AD, BC, BD, CD. In total 12,812,133distinct (and deduplicated) pairs of papers linked throughcommon authors were identified, and the distances betweeneach pair in map units were calculated. Map units are shownin Figure 1. Each map was normalized to 1,000 units in boththe x and y directions to allow direct comparison betweenthe two maps. Figure 3 shows that a significant fraction ofthe distances in the sequentially hybrid (BM25) layoutare shorter than those in the citation-based layout. The dis-tribution for the text-based layout is bimodal, with the firstpeak at 1.5 log units and the second peak at 5.5 log units,whereas the citation-based layout has only a single peak at5.3 log units, with a tail at the leading end. To get a sense of

FIG. 2. Similarity value distributions for the two map layout methods.[Color figure can be viewed in the online issue, which is available atwileyonlinelibrary.com.]

676 JOURNAL OF THE ASSOCIATION FOR INFORMATION SCIENCE AND TECHNOLOGY—April 2014DOI: 10.1002/asi

Page 8: 0907 system 2 account mobiiibi kgk bah

what the log units mean in practice, Figure 1 (right) showslog unit scales; 1.5 log units has a length on the map of aboutone fifth the length of the 3.2-log-unit bar. The sequentiallyhybrid layout, on average, clearly places papers by the sameauthor closer to each other within the map than does thecitation-based layout. Given that multiple papers by anauthor should be closer together on a map rather than fartherapart, this suggests that the sequentially hybrid map is moreaccurate, as well as more appealing, than the citation-basedmap.

A Dynamic Global Model and Map of Science

Model Creation and Characterization

The improvements to science mapping described in theprevious sections can be used to generate more accurate

models and maps of science. These improvements havebeen incorporated into our processes and have been used tocreate a new model and map of science using a 16-year(1996–2011) set of Scopus data comprising nearly 20million documents. Although the entire Scopus data fromthose years contains 25.6 million records, only 20.6million of those have references, thus comprising ourbasis set.

The model and map were created in several steps, whichare briefly mentioned here. Detailed protocols can be foundin Appendices A and B.

• Annual models are created for each year. Each annual modelis calculated by first generating clusters of cited references(cocitation clusters) and then assigning the current-yearpapers to the cocitation clusters using the bibliographic-coupling approach introduced earlier in this paper.

• Annual models are then linked into longitudinal structures(called “threads”) by linking clusters of documents from adja-cent years together using overlaps in the cited referencesbelonging to each cluster.

Sixteen annual models were created using this processfor the 16-year period from 1996 to 2011. Table 3, whichcontains numbers of papers and clusters by year, shows thatthe process is relatively stable in terms of cluster sizes andthe fraction of articles covered by each annual model.Roughly 4% of the articles containing references aremissing from the models. The majority of these articles aremissing because they did not contain any references thatwere included in the cocitation calculations.

The 16 annual models of Table 3 were then linkedtogether into a longitudinal model of science. Pairs of clus-ters from adjacent years whose cosine (based on overlap oftheir cited references) was greater than the FwdCos valuelisted in Table 3 were linked. Examples of linked sets ofclusters, that is, threads, are shown in Figure 4.

FIG. 3. Distribution of distances between papers authored by the sameresearcher for the two map layout methods. [Color figure can be viewed inthe online issue, which is available at wileyonlinelibrary.com.]

TABLE 3. Annual details of the CC-BC model of science.

Year #Clust #Pap Pap/Clust %Pap #Ref Ref/Clust FwdCos

1996 54,221 752,442 13.88 95.2 1,072,014 19.77 0.25931997 56,225 774,390 13.77 95.3 1,108,296 19.71 0.25781998 57,434 788,643 13.73 95.3 1,149,310 20.01 0.25721999 59,048 808,027 13.68 95.3 1,215,370 20.58 0.26052000 64,072 876,335 13.68 95.4 1,333,079 20.81 0.25752001 70,680 965,106 13.65 95.7 1,447,172 20.47 0.25402002 74,207 1,004,837 13.54 96.0 1,541,707 20.78 0.25472003 79,657 1,080,103 13.56 96.2 1,665,590 20.91 0.25352004 90,074 1,212,349 13.46 96.1 1,854,537 20.59 0.25002005 98,848 1,332,524 13.48 96.2 2,058,536 20.83 0.24512006 107,197 1,451,006 13.54 96.1 2,243,455 20.93 0.23852007 113,426 1,546,811 13.64 95.6 2,360,593 20.81 0.23572008 121,595 1,645,524 13.53 95.6 2,539,626 20.89 0.22942009 130,701 1,754,603 13.42 96.0 2,759,731 21.11 0.22302010 135,836 1,807,757 13.31 96.4 2,930,351 21.57 0.22272011 151,305 2,004,176 13.25 96.6 3,277,735 21.66Total 1,464,526 19,804,633 13.52 95.9

Note. Clust, clusters; Pap, current papers; Ref, cited references; FwdCos, cosine threshold for linking.

JOURNAL OF THE ASSOCIATION FOR INFORMATION SCIENCE AND TECHNOLOGY—April 2014 677DOI: 10.1002/asi

Page 9: 0907 system 2 account mobiiibi kgk bah

Table 4 shows the age characteristics of the resultingthreads. Roughly 40% of the annual clusters are what wecall “isolates.” These are clusters that link neither forwardnor backward within the overall model (above a specificthreshold: the FwdCos column in Table 3) and can thus bethought of as 1-year threads. These are research problemsthat do not have enough momentum to continue into asecond year. We note that the exact fraction of isolates is afunction of the linking thresholds used in the model (seeAppendix B). However, this fraction does not decrease dra-matically with a lowering of the threshold values. Isolatesare typically among the smallest clusters, whereas thelongest threads comprise larger clusters on average. Forty-six percent of annual clusters are in threads that last for 3years or longer.

This thread model reflects a combination of stable andunstable science. More than 28% of its papers are inthreads that are aged 8 years or older. This forms a verystable core of science at the level of research problems.Only 31% of papers are in isolates; thus, less than onethird of the articles published each year are experimentsthat are not followed up cognitively in the next year. Werealize that many researchers may feel that science is morestable than suggested by this model. It is true that scienceis quite stable when measured at the level of disciplines.However, it is also true that many scientists work on avariety of research problems and move from problem toproblem in search of problems that have staying power. Itis at this level that science is unstable.

Table 4 also shows the importance of topic momentum.The “Pcont” column shows the probability that a thread willcontinue to the next year as a function of thread age. Only26.5% of threads that are newly born (first year) will con-tinue to a second year. Once a thread is 2 years old, it has a60.6% chance of continuing to a third year. Survival ratesincrease with age. We note that the survival rate decreasesslightly for several years after the eighth year. This is anartefact of our threading criterion (see Appendix A) of notallowing two long threads to be retrospectively linkedtogether.

Visual Map

A visual map (Figure 5) has been created from the modelusing the text-based layout method described above. BM25coefficients were calculated between pairs of threads usingtheir titles and abstracts. Isolates were not included in thiscalculation; only the 190,151 threads lasting for 2 years orlonger were included in the layout. Each article wasassigned a color based on the color-to-journal scheme usedby the University of California at San Diego (UCSD) map ofscience (Börner, Klavans, et al., 2012), and each thread wascolored by dominant article color. Isolates were added intothe map later by positioning each isolate next to the threadwith which it shared the largest set of fractional paperassignments.

This map is very similar in terms of high level structureto the consensus view of science (Klavans & Boyack,2009). The so-called hard sciences, engineering, and com-puter science all appear in the top half of the map, whereasthe medical and related sciences appear in the lower half ofthe map. Chemistry (blue) is to the right, and the SocialSciences (light orange) are to the left. The map can easilybe time sliced to show growth in the various areas ofscience over time. Although this is not dynamic in thesense that it does not allow clusters to move in the mapfrom year to year, it does show births and deaths in clus-ters. This map is shown here at a high level, but smallersections can be enlarged to show greater detail. It can alsobe used as a template on which additional information,such as cluster ages or output from a particular author,journal, or institution, can be overlaid (Rafols et al., 2010).

FIG. 4. Examples of threads. a: Three year. b: Four year. c: Five year witha split. d: Four year with a merge.

TABLE 4. Characteristics of threads in the linked cocitation/bibliographic coupling model.

Age (years) #Thr #Clust Clust/Thr/year Pap/Clust Pcont

1 596,807 596,807 1.000 10.31 0.2652 92,191 189,268 1.026 12.08 0.6063 35,309 113,688 1.073 13.44 0.7484 17,569 79,481 1.131 14.70 0.8295 11,687 69,842 1.195 15.66 0.8596 7,914 59,895 1.261 16.50 0.8857 5,969 56,049 1.341 17.00 0.8968 4,406 50,106 1.422 17.54 0.9119 3,812 49,078 1.431 18.26 0.894

10 3,024 43,337 1.433 18.55 0.88911 2,562 41,209 1.462 18.86 0.87812 1,791 31,008 1.443 18.86 0.88313 1,201 22,785 1.459 19.16 0.89414 738 14,995 1.451 19.27 0.91115 538 11,734 1.454 19.24 0.91616 1,440 35,244 1.530 21.01Total 1,464,526

Note. Thr, threads; Pcont, probability of continuing to the next year.

678 JOURNAL OF THE ASSOCIATION FOR INFORMATION SCIENCE AND TECHNOLOGY—April 2014DOI: 10.1002/asi

Page 10: 0907 system 2 account mobiiibi kgk bah

Definition of Discipline-Like Structures

To make the map more useful for answering high-levelquestions and to allow it to be compared with existingdiscipline-level maps, we have partitioned the map intodiscipline-like groups using a semiautomated process. To dothis, we overlaid the map with an 85 ¥ 85 grid structure and“grew” disciplines from grid cells that were chosen as dis-cipline seeds. The principle behind this process is to take agrid cell (or a group of adjacent grid cells) and link to it thegrid cell with which it has the greatest overlap and then torepeat that process iteratively until all grid cells have beenlinked into a group. Square grid cells have no inherentoverlap, so for the growth calculation we define the gridcells as being circular such that the square grid cell would fitentirely with its corresponding circular grid cell (seeFigure 6). With circular grid cells, most threads and isolatesbelong to more than one cell, thus creating natural overlapsbetween grid cells that can be used as a linkage mechanism.For example, the center cell in Figure 6 has potentially

substantial overlaps with the cells above, below, and toeither side of it. In our process, if the center cell were chosenas the seed, the cell with the largest number of articles in theoverlap space with the center cell would be linked to it. Inthe second iteration, the cell with the largest overlap witheither the center cell or the first one that was connected to itwould then be linked to the group.

We started by identifying a set of grid cells as disciplineseeds, as shown at top left in Figure 7. The first set of seedswas identified automatically by choosing the grid cell ineach 5 ¥ 5 subsection of the map with the largest number ofarticles. This starting set was modified by hand, deletingseeds in which two seeds were too close to each other andadding seeds on “islands” in the map that did not have seeds.Disciplines were then grown from the seeds over the courseof 40 iterations using the linkage process described above. Aconstraint was placed on the first 20 iterations that requiredlinkages to be between grid cells of the same color. Thisensures that the initial, large growth of disciplines waswithin the established high-level grouping of science. This

FIG. 5. Visual map of the CC-BC 16-year model of science. [Color figure can be viewed in the online issue, which is available at wileyonlinelibrary.com.]

JOURNAL OF THE ASSOCIATION FOR INFORMATION SCIENCE AND TECHNOLOGY—April 2014 679DOI: 10.1002/asi

Page 11: 0907 system 2 account mobiiibi kgk bah

restriction was lifted for the final 20 iterations, thus allowingthe disciplines to grow in an interdisciplinary way whereappropriate.

Figure 7 shows instances of the results of the growthprocess at three different points, after seed identification,after 10 iterations, and after all 40 iterations. After 40 itera-tions, no more grid cells were being linked to existing group-ings, and the calculation was stopped.

With the calculation complete, threads and isolates wereassigned to disciplinary groups. To avoid problems withmultiple assignments, each thread and isolate was assignedto the single square grid in which it occurs. Circular gridcells were not used for this assignment process. The bottomright panel in Figure 7 shows the resulting 211 discipline-

like groups that were created using this process. Approxi-mately 250 grid cells with very low article counts, a mere0.4% of the articles in the model, did not end up being linkedinto any disciplinary group because of lack of overlap. Thesewill be assigned to existing disciplines as needed.

We realize that the disciplinary structure defined here issomewhat arbitrary, depending largely on the number ofseeds that were used. However, it is less arbitrary than onemight think. Close examination of the bottom left panel inFigure 7 shows that the density of articles in different gridcells varies widely. It is natural to assume that disciplinarystructures will be based on dense areas within the map andseparated by areas with much lower density. Comparison ofthe two bottom panels of Figure 7 shows that the majority ofthe disciplines are associated with areas of high articledensity and are separated by areas of lower density. This isconsistent with the principle behind creation of journal-based disciplines, which is that journals are grouped suchthat the within-group citation density is high and thebetween-group citation density is low, suggesting that thedisciplinary groupings shown in Figure 7 are reasonable.

A detailed description of the resulting disciplinary struc-ture is beyond the scope of this article. Nevertheless, wegive examples of some disciplinary groups that exemplifydifferent combinations of properties (Table 5). These disci-plines were chosen because they provide examples ofdifferent types of extremes with respect to properties.For example, “Superconductivity” and “Elementary par-ticle physics” are examples of disciplines with low growthrates, relatively high stability (a high percentage of long-duration threads), and monodisciplinarity (high correlationwith journal-based disciplines). “Sleep”; “Environmental,energy and economic policy”; “Business ethics and strat-egy”; and “Literature and language” are examples of theexact opposite combination of properties: high growthrates, relatively low stability, and multidisciplinarity. Ingeneral, growth rates tend to be inversely correlated withstability. Some interesting exceptions to these trends are“Psychoanalysis,” which has low growth and low stability,and “Power transmission and distribution,” which combineshigh growth with relatively high stability while being very

FIG. 6. Sample of a grid structure showing square and roundrepresentations of the grid cells and overlaps between round grid cells.[Color figure can be viewed in the online issue, which is available atwileyonlinelibrary.com.]

TABLE 5. Examples of disciplines identified from the CC-BC map of science.

Discipline name #Pap Growth Mono %Unstable %Stable

Superconductivity 111,980 0.79 0.183 61.1 5.6Psychoanalysis 24,073 0.98 0.159 82.1 0.3Endocrinology 88,542 1.00 0.021 75.7 3.0Elementary particle physics 108,123 1.07 0.299 57.2 9.2Sleep 79,948 1.57 0.020 79.4 1.3Environment, energy, and economic policy 97,436 1.59 0.029 82.6 0.8Nanotechnology 83,449 1.60 0.038 73.7 1.3Business ethics and strategy 182,689 1.64 0.034 82.6 1.2Literature and language 108,493 1.71 0.017 84.9 0.1Power transmission and distribution 174,313 1.85 0.163 65.1 3.3

Note. Mono, level of monodisciplinarity; growth, #Pap2011/#Pap2007; %unstable, percentage of threads that are isolates; %stable, percentage of threadslasting at least 8 years.

680 JOURNAL OF THE ASSOCIATION FOR INFORMATION SCIENCE AND TECHNOLOGY—April 2014DOI: 10.1002/asi

Page 12: 0907 system 2 account mobiiibi kgk bah

monodisciplinary. “Nanotechnology” exhibits high growthwith midrange stability, whereas “Endocrinology” exhibitslow growth with midrange stability.

The monodisciplinarity index of Table 5 was calculatedfor each of the disciplines in the map of Figure 6 by locatingtheir articles in 554 journal-based disciplines from theUCSD map of science (Börner, Klavans, et al., 2012) andthen calculating a Herfindahl (concentration) index on theresulting fractions. A high value indicates a high degree ofoverlap with a journal-based disciplinary structure, whereas

a low value indicates the opposite. One hundred fifty-threeof the two hundred eleven disciplines have a concentrationvalue of less than 0.1 (the highest value is 0.56 for“Astronomy”), which shows that this article-based disciplin-ary structure is significantly different from a journal-baseddisciplinary structure across a wide range of science. Giventhat this model was built from the ground up based oninteractions at the article level, and thus reflects the structureof science at a much more topical level than can be found ina journal-based model, these results once again call into

FIG. 7. Creating disciplines by dividing the map of science into a series of fine grids, specifying grid points as discipline seeds, and iteratively linkingadjacent grid points to those already chosen. Grid dots reflect the number of papers in each grid. Final disciplines and relative sizes are shown at bottom right.[Color figure can be viewed in the online issue, which is available at wileyonlinelibrary.com.]

JOURNAL OF THE ASSOCIATION FOR INFORMATION SCIENCE AND TECHNOLOGY—April 2014 681DOI: 10.1002/asi

Page 13: 0907 system 2 account mobiiibi kgk bah

question the validity of journal-based structures of science(Boyack & Klavans, 2011).

Implications and Future Directions

This article reports on a 16-year model of science con-taining nearly 20 million documents that was created usinga combined cocitation and bibliographic-coupling process.The bibliographic-coupling-based process to assign currentpapers to cocitation clusters is new and is an advance inprocess that significantly increases the accuracy of themodel. A sequentially hybrid approach to producing usefulvisual maps from models was introduced and was shownlikewise to increase accuracy. The resulting cluster structureshows areas of high density, with pathways between thoseareas of high density. This structure is amenable to segmen-tation into discipline-like groupings, and a process wasintroduced to split the map into disciplines.

This highly detailed, dynamic model and map of sciencewas created primarily to use in answering planning-relatedquestions. Despite this, it can be used for evaluation func-tions. For example, segmentation of the map into discipline-like groupings allows calculation of field-normalizedmetrics using an article-level, rather than journal-level, setof standards. Given the recent success of citation metricsbased on citing-side normalization (Zitt, 2010) or fractionalcitation counting (Waltman et al., 2012), it is worth investi-gating whether field normalization based on an article-levelclassification system would be a suitable alternative to exist-ing normalization methods.

We have a variety of efforts underway that use the map toanswer detailed questions raised by research planners. Forexample, the threads from this map have been combinedwith a direct citation approach to successfully identifyrecent, emerging topics in science (Small, Boyack, &Klavans, 2013). We are investigating new types of metrics toidentify innovative (rather than impactful) articles andresearchers and are working with funders to correlate dataassociated with grant applications with metrics associatedwith applicants, referees, and review panels. Initial indica-tions of the results of these new investigations are all posi-tive. These investigations all rely on the ability to modelscience at the highly detailed level of the model and mapreported here and thus suggest that this model will proveuseful in answering detailed policy-related questions.

We note that these efforts are retrospective rather thanpredictive. However, they are moving us toward predictionin the sense that we are developing a detailed understandingof how science operates and how those operations areembodied in a model. This understanding is one key toprediction; a better understanding of fundamental operationsin science and technology and how those impact structurewill, in time, allow us to project from current models into thefuture.

Improving science models is a continuous process. Themodel presented here can be improved in several ways.Because it relates to planning applications, this model is not

global enough in that it does not include enough technology(patents) or other technical literature that is not indexed bythe large citation databases. It includes very little non-English-language literature. For example, the Chinese sci-entific and technical databases have millions of additionaldocuments that, if added, could substantially change ourperception of the landscape of science and technology. Inaddition, these maps do not include funding sources, whichoften can be thought of as early indicators of where scienceis moving. Finally, as models increase in size and complex-ity, these increases must be accompanied by increasingnumbers of validation studies.

Acknowledgments

We thank Michael Patek for extracting the necessary dataand creating and running the scripts that create our modelsof science. The sequentially hybrid layout work describedhere was supported by the Center for Scientific Review(CSR) at the U.S. National Institutes of Health.

References

Börner, K., Chen, C., & Boyack, K.W. (2003). Visualizing knowledgedomains. Annual Review of Information Science and Technology, 37,179–255.

Börner, K., Boyack, K.W., Milojevic, S., & Morris, S. (2012). An intro-duction to modeling science: Basic model types, key definitions, and ageneral framework for the comparison of process models. In A. Scharn-horst, P. van den Besselaar, & K. Börner (Eds.), Understanding complexsystems (pp. 3–22). Berlin: Springer-Verlag.

Börner, K., Klavans, R., Patek, M., Zoss, A.M., Biberstine, J.R., Light, R.P.,et al. (2012). Design and update of a classification system: The UCSDmap of science. PLoS ONE, 7(7), e39464.

Boyack, K.W. (2009). Using detailed maps of science to identify potentialcollaborations. Scientometrics, 79(1), 27–44.

Boyack, K.W., & Klavans, R. (2010). Co-citation analysis, bibliographiccoupling, and direct citation: Which citation approach represents theresearch front most accurately? Journal of the American Society forInformation Science and Technology, 61(12), 2389–2404.

Boyack, K.W., & Klavans, R. (2011). Multiple dimensions of journalspecificity: Why journals can’t be assigned to disciplines. 13th Interna-tional Conference of the International Society for Scientometrics andInformetrics, pp. 123–133.

Boyack, K.W., Klavans, R., & Börner, K. (2005). Mapping the backbone ofscience. Scientometrics, 64(3), 351–374.

Boyack, K.W., Newman, D., Duhon, R.J., Klavans, R., Patek, M., Biber-stine, J.R., et al. (2011). Clustering more than two million biomedicalpublications: Comparing the accuracies of nine text-based similarityapproaches. PLoS One, 6(3), e18029.

Chen, C., Chen, Y., Horowitz, M., Hou, H., Liu, Z., & Pelligrino, D. (2009).Towards an explanatory and computational theory of scientific discovery.Journal of Informetrics, 3, 191–209.

Franklin, J.J., & Johnston, R. (1988). Co-citation bibliometric modeling asa tool for S&T policy and R&D management: Issues, applications, anddevelopments. In A.F.J. van Raan (Ed.), Handbook of quantitativestudies of science and technology (pp. 325–389). Amsterdam: ElsevierScience Publishers, B.V.

Garfield, E., Malin, M.V., & Small, H. (1978). Citation data as scienceindicators. In Y. Elkana, J. Lederberg, R.K. Merton, A. Thackray, & H.Zuckerman (Eds.), Toward a metric of science: The advent of scienceindicators. New York: John Wiley & Sons.

Griffith, B.C., Small, H.G., Stonehill, J.A., & Dey, S. (1974). Structure ofscientific literatures. 2. Toward a macrostructure and microstructure forscience. Science Studies, 4(4), 339–365.

682 JOURNAL OF THE ASSOCIATION FOR INFORMATION SCIENCE AND TECHNOLOGY—April 2014DOI: 10.1002/asi

Page 14: 0907 system 2 account mobiiibi kgk bah

Klavans, R., & Boyack, K.W. (2006). Quantitative evaluation of large mapsof science. Scientometrics, 68(3), 475–499.

Klavans, R., & Boyack, K.W. (2009). Toward a consensus map of science.Journal of the American Society for Information Science and Technol-ogy, 60(3), 455–476.

Klavans, R., & Boyack, K.W. (2010). Toward an objective, reliable andaccurate method for measuring research leadership. Scientometrics,82(3), 539–553.

Klavans, R., & Boyack, K.W. (2011). Using global mapping to create moreaccurate document-level maps of research fields. Journal of the AmericanSociety for Information Science and Technology, 62(1), 1–18.

Klavans, R., Boyack, K.W., & Small, H. (2012). Indicators and precursorsof “hot science.” In 17th International Conference on Science and Tech-nology Indicators (pp. 475–487). Montreal, Canada.

Leydesdorff, L., & Rafols, I. (2009). A global map of science based on theISI subject categories. Journal of the American Society for InformationScience and Technology, 60(2), 348–362.

Lin, J. (1991). Divergence measures based on Shannon entropy. IEEETransactions on Information Theory, 37(1), 145–151.

Lin, J., & Wilbur, W.J. (2007). PubMed related articles: A probabilistictopic-based model for content similarity. BMC Bioinformatics, 8, 423.

Martin, B.R., Nightingale, P., & Yegros-Yegros, A. (2012). Science andtechnology studies: Exploring the knowledge base. Research Policy,41(7), 1182–1204.

Martin, S., Brown, W.M., Klavans, R., & Boyack, K.W. (2011). OpenOrd:An open-source toolbox for large graph layout. Proceedings of SPIE—The International Society for Optical Engineering, 7868, 786–806.

Moed, H.F., Glänzel, W., & Schmoch, U. (Eds.). (2004). Handbook ofquantitative science and technology research: The use of publication andpatent statistics in studies of S&T systems. Dordrecht: Kluwer AcademicPublishers.

Morris, S.A., & Martens, B.V. (2008). Mapping research specialties.Annual Review of Information Science and Technology, 42, 213–295.

Rafols, I., Porter, A.L., & Leydesdorff, L. (2010). Science overlay maps: Anew tool for research policy and library management. Journal of theAmerican Society for Information Science and Technology, 61(9), 1871–1887.

Small, H. (1999). Visualizing science by citation mapping. Journal of theAmerican Society for Information Science, 50(9), 799–813.

Small, H., Sweeney, E., & Greenlee, E. (1985). Clustering the ScienceCitation Index using co-citations. II. Mapping science. Scientometrics,8(5/6), 321–340.

Small, H., Boyack, K.W., & Klavans, R. (2013). Identifying emergingtopics by combining direct citation and co-citation. In 14th InternationalConference of the International Society for Scientometrics and Informet-rics. Vienna, Austria.

Spärck Jones, K., Walker, S., & Robertson, S.E. (2000a). A probabilisticmodel of information retrieval: Development and comparative experi-ments. Part 1. Information Processing & Management, 36(6), 779–808.

Spärck Jones, K., Walker, S., & Robertson, S.E. (2000b). A probabilisticmodel of information retrieval: Development and comparative experi-ments. Part 2. Information Processing & Management, 36(6), 809–840.

van Raan, A.F.J. (2005). Reference-based publication networks withepisodic memories. Scientometrics, 63(3), 549–566.

Waltman, L., & Van Eck, N.J. (2012). A new methodology for constructinga publication-level classification system of science. Journal of theAmerican Society for Information Science and Technology, 63(12),2378–2392.

Waltman, L., Calero-Medina, C., Kosten, J., Noyons, E.C.M., Tijssen,R.J.W., van Eck, N.J., et al. (2012). The Leiden Ranking 2011/2012:Data collection, indicators, and interpretation. Journal of the AmericanSociety for Information Science and Technology, 63(12), 2419–2432.

White, H.D., & McCain, K.W. (1997). Visualization of literatures. AnnualReview of Information Science and Technology, 32, 99–168.

Zitt, M. (2010). Citing-side normalization of journal impact: A robustvariant of the Audience Factor. Journal of Informetrics, 4, 392–406.

Appendix A: Cocitation Model ProtocolThis protocol first appeared in Boyack and Klavans

(2010). The data from each source publication year are con-sidered separately so that annual models can be built foreach year. To create an annual model from a single year’sdata, we use the following method.

• An age-dependent threshold is applied to identify a set of highlycited references. A relatively low threshold is used; references areincluded if cited five or more times in a single year. Recentreferences have a lower threshold; those published within 3 yearsmust have at least age + 1 cites, whereas those of age 0 must havebeen cited twice. This threshold maximizes the number of citedreferences that are retained in the calculation, subject to thecurrent limits of our clustering approach.

• Cocitation frequencies, Ci,j, between pairs of reference docu-ments i and j were calculated from the citing:cited pairs list.

• Each cocitation frequency was modified using

F p Ci j i j, ,log ,= +[ ]( )1 1 (A1)

where p(Ci,j + 1) = Ci,j (Ci,j + 1)/2.• K50 (modified cosine) values were calculated from each Fi,j

value as:

K KF E

S S

F E

S Si j j i

i j i j

i j

j i j i

i j

50 50, ,, , , ,max ,= =

−( ) −( )⎡

⎣⎢

⎦⎥ (A2)

where ES S

SS Si j

i j

i, =

−( ), S F j ii i j

j

n

= ≠=

∑ , ,1

, SS Sii

n

==∑

1

.

E is an expected value of F and varies with Sj; K50 differs frommost other measures in that it is a relative measure that subtractsout the expected value. Thus K50 will be positive only for thosereference paper interactions that are larger than expected giventhe matrix row and column sums. Note also that, although Ei,j �

Ej,i, the differences in these values are typically too small to be ofconsequence for article-level similarities, even though they canbe quite large for journal-level similarities.

• Given the current practical limits of the OpenOrd layout algo-rithm, we filter the relatedness matrix, keeping only the top-nmost-related references (and their relatedness values) for eachreference. We vary the n in top-n from 5 to 15, and scale itsvalue on log(degree), where degree is the number of other ref-erences to which the reference is linked through cocitation. Theresult of this step is a file with pairs of reference papers andtheir similarity values.

The reference clustering process is comprised of thefollowing steps:

1. The OpenOrd (formerly DrL) graph layout routine (Martinet al., 2011) was run using a similarity file as input and using acutting parameter of 0.975 (maximum cutting). DrL uses arandom walk routine and prunes edges based on degree andedge distance; long edges between nodes of high degree arepreferentially cut. A typical DrL run using an input file of 2Marticles and 15M edges will cut approximately 60% of the inputedges, where an edge represents a single document–documentsimilarity pair from the original similarity file. At the end of the

JOURNAL OF THE ASSOCIATION FOR INFORMATION SCIENCE AND TECHNOLOGY—April 2014 683DOI: 10.1002/asi

Page 15: 0907 system 2 account mobiiibi kgk bah

layout calculation, each article has an x,y position, and roughly40% of the original edges remain.

2. Papers were assigned to clusters using an average-linkage clus-tering algorithm (Klavans & Boyack, 2006). The average-linkage clustering algorithm uses the article positions (x,y) andremaining edges to assign papers to clusters. Once the clustersare generated, the full list of pairs of papers that co-occur in acluster is generated for each solution. For example, if papers A,B, C, and D are in a cluster together, the set of pairs will be AB,AC, AD, BC, BC, and CD.

3. Steps 1 and 2 were run 10 separate times using 10 differentrandom starting seeds for DrL, giving rise to 10 unique clustersolutions for the same similarity file. Different starting seeds(i.e., different starting points for the random walk) will give riseto different graph layouts and different (but typically highlyoverlapping) sets of remaining edges. We use these differencesto our advantage in this clustering process.

4. Those paper pairs that appear in six or more of the 10 DrLsolutions are considered to be the robust pairs and are listed ina separate file. This list of pairs is then used as the input edgesto the same average-linkage clustering algorithm used in theprevious step. With this input, the algorithm essentially findsand outputs all distinct graph components. Each separate com-ponent is a cluster, and these clusters are referred to as level 0clusters.

5. Logic dictates that a cluster should have a minimum size; oth-erwise, there is not enough content to differentiate it fromother clusters. In our experience, a cluster should contain aminimum of approximately five papers per year to be consid-ered topical. Thus we take all clusters with fewer than 5papers and aggregate them. This is done by calculating K50similarities between all pairs of level 0 clusters and thenaggregating each small cluster (<5 papers) with the cluster towhich it has the largest K50 similarity until no clusters with<5 papers remain. K50 values are calculated from aggregatedmodified frequency values (the 1/log[p(C + 1)] values) whereavailable and from the aggregated top-n similarity values in allother cases. The resulting aggregated clusters are known aslevel 1 clusters.

Annual models are then linked into a longitudinal modelof science by linking clusters of documents from adjacentyears together using overlaps in the cited references belong-ing to each cluster. For each pair of years, linking is doneusing the superset of references from the 2 years’ models.Typically, one third of the references are present in bothmodels, one third are present only in the first year’s model,and the other one third are present only in the second year’smodel. The majority of the references that are in only onemodel are missing from the other only because they did notmeet the citation threshold and can be easily added to theother model using their reference lists. This process gener-ates augmented reference lists for each model. For a givenmodel, the augmented reference list used for linking to theprior year’s model will be somewhat different from theaugmented reference list used for linking to subsequentyear’s model.

Using these augmented reference lists, clusters fromadjacent models are linked if a simple cosine index based onthe number of overlapping references is above a threshold.

Although it would be nice to set this cosine threshold basedon theory, in practice we find that it requires a heuristicapproach. If the cosine threshold is set too low, a giantcomponent quickly emerges and the longitudinally linkedsets of clusters become so large that they no longer representresearch problems. If the cosine threshold is set too high,there is very little linking between clusters. We also foundthat, given that science is growing and that the number ofclusters increases each year, using a single cosine thresholdfor all pairs of years created less linking for later years (late2000s) than for earlier years (late 1990s). This was an unde-sirable effect. To create consistency in the linking patternsover time, we ran linking calculations at a variety of linkagefractions, where linkage fraction is defined as the fraction ofclusters in a given year that link to a cluster in the subse-quent year. This required calculation of the cosine thresholdfor each year that would return the desired linkage fraction.For each set of calculations, we examined the average andmaximum numbers of forward links per cluster (of thosethat have forward links), along with the maximum threadsize. We chose a linkage fraction of 0.48, meaning that only48% of the clusters link to a cluster in the subsequent year.This gave us average and maximum numbers of forwardlinks per cluster of 1.1 and 5, respectively. The correspond-ing forward linking cosine threshold values for each year arelisted in Table 3 and range from 0.260 to 0.223, typicallydecreasing over time.

When linking clusters in this way over many years, oneadditional problem can arise: two long threads can be linkedtogether in a later year if the cosine value is high enough,creating artificially large threads. For example, using themethod and thresholds detailed above, the largest thread had872 clusters, or 54 clusters per year. Although this type oflinkage can reflect the history of how topics link togetherfrom a retrospective point of view, it does not necessarilyreflect how each thread grows. We thus implemented anadditional step in our threading calculation. For cases inwhich a single cluster merges two threads, where one threadis at least 8 years old and the other is at least 5 years old, thecluster is assigned to the shorter thread and not to the longerone, despite the fact that the cosine threshold is met in bothcases. This criterion still allows very long threads to form,but it does not allow retrospectively joining of long existingthreads. The effect of this criterion on size was to reduce thelargest thread to 66 clusters (or four per year). This is asignificant improvement in our view; the majority of thethreads remain thin and are not dominated by branching andthus represent coherent research problems as they movethrough time.

Appendix B: Bibliographic CouplingAssignment Process

The detailed process for assigning current papers to coci-tation clusters is as follows:

1. A bibliographic coupling solution for the current papers wascalculated using the methodology from Boyack and Klavans

684 JOURNAL OF THE ASSOCIATION FOR INFORMATION SCIENCE AND TECHNOLOGY—April 2014DOI: 10.1002/asi

Page 16: 0907 system 2 account mobiiibi kgk bah

(2010). At this point, there are two solutions: the original coci-tation solution, which fractionally assigns papers to clusters(PID CC wt), and the bibliographic-coupling solution, whichsingly assigns papers to cluster (PID BC).

2. The current papers (PID) are divided into three groups:a. Group A, those that are in both solutionsb. Group B, those that are only in the BC solutionc. Group C, those that are only in the CC solution

3. For all papers in group A, a figure of merit (FOM) based on acombination of the cocitation (CC) and bibliographic coupling(BC) clusters was calculated.a. The BC cluster was assigned to each PID in the CC solution

to create a table with entries (PID CC BC wt)b. Weights were summed by CC:BC pair (CC BC sumccbc)c. Weights were summed by BC cluster (BC sumbc) and added

to the table in step 3b (CC BC sumccbc sumbc)d. Divide sumccbc by sumbc to get FOM for each CC:BC pair

(CC BC FOM)

e. This FOM replaces the original weights for each PID withina CC:BC pair (PID CC BC FOM)

f. FOM are summed and normalized so that they sum to 1.0 foreach PID (PID CC FOMnorm). These FOMnorm becomethe new weights for PID in the cocitation clusters, replacingthe old weights.

4. Bibliographic coupling clusters some papers that are not clus-tered by the cocitation solution. These papers (group B) wereassigned to cocitation clusters bya. Creating a CC:BC cosine relatedness matrix from the

FOM in step 3d where the matrix values are cos = FOM/sqrt(rowsum · columnsum)

b. Linking PID to CC using their BC clusters (PID CC BC cos)c. Singly assigning the PID to the CC cluster with the highest

cosine value (PID CC 1.0)5. Cocitation clusters some papers that are not clustered by the

bibliographic coupling solution (group C). The simple fractionalassignments from cocitation analysis are used for these papers.

JOURNAL OF THE ASSOCIATION FOR INFORMATION SCIENCE AND TECHNOLOGY—April 2014 685DOI: 10.1002/asi