Identifying Emerging Research Related to Solar Cells Field using a Machine Learning Approach

The number of research papers related to solar cells field is increasing rapidly. It is hard to grasp research trends and to identify emerging research issues because of exponential growth of publications, and the field’s subdivided knowledge structure. Machine learning techniques can be applied to the enormous amounts of data and subdivided research fields to identify emerging researches. This paper proposed a prediction model using a machine learning approach to identify emerging solar cells related academic research, i.e. papers that might be cited very frequently within three years. The proposed model performed well and stable. The model highlighted some articles published in 2015 that will be emerging in the future. Research related to vegetable-based dye-sensitized solar cells was identified as the one of the promising researches by the model. The proposed prediction model is useful to gain foresight into research trends in science and technology, facilitating decision-making processes.


INTRODUCTION
Analyzing trends in academic research can be very helpful when determining the direction of technical developments. This is particularly true in a field such as solar photovoltaic power, which uses technologies that have close linkages to scientific knowledge.
Many methods have been applied in various fields to produce technological forecasts by gathering experts and making a consensus. Recently, some weaknesses have been pointed out with these methods. One is that the individuals who create the forecast are increasingly dependent on the relevant knowledge-base; committee members could produce a useful forecast on their own. Another issue is the huge amount of related data. Few professionals can completely ascertain a comprehensive image of the field. The number of related research papers rapidly increases, so it is difficult for one person, restricted by time and resource constraints, to perceive the contents of all available papers.
Researchers now need methods that can identify emerging research in advance from the vast amounts of available information. The large amounts of data and finely segmented research fields have necessitated such methods, spurring the development of machine-learning techniques. Among such techniques, some methods have been proposed to identify emerging research efforts that might eventually lead to great advances. Emerging research is one that might develop into remarkable and fruitful research activities, although it may not have been in the spotlight at the time of publication. In this paper, the prediction of emerging research was defined as advance identification of papers that might be cited very frequently at a later date.
Many earlier works have proposed methods for estimating and predicting emerging fields in science and technology. Winnink and Tijssen demonstrated the predictability of emerging fields in graphene research, which eventually developed into a paper that won a Nobel Prize [1]. Adams reported a correlation between the numbers of citations that arose in the literature 3-10 years after publication of a paper and those 1-2 years after its publication [2]. Goffman and Newill modeled the propagation of information similarly to the spread of plague [3]. Bettencourt et al. described the propagation of new fields using a Susceptible-Infected-Recovered (SIR) model that had been used to simulate a spreading plague [4]. Chen et al. assessed research papers related to structural holes of networks making use of a co-citation network and a joint-research network [5].
Kajikawa et al. collected papers related to solar photovoltaic power generation, constructed a landscape of academic knowledge and demonstrated that the field is divided into four clusters [6]. Lizin et al. described a landscape of academic knowledge related to patent data and compared it with an organic photovoltaic effect [7]. Sakata and Sasaki analyzed the publication trends in the field of solar photovoltaic power generation in several countries; their results showed that Asian countries keep up with global trends [8]. Shibata et al. analyzed bibliographic data from academic papers and patents, and discussed development prediction in fields that had sufficient research papers but few patents [9]. Consequently, many reports described a general landscape and reviews, but there were no attempts to predict the growth of citations in the field of solar cell. Therefore, we believe that our research is important to the field of solar cell.
Methods for predicting emerging research have been proposed by researchers in bibliometrics or library and information science. However, owing to the increasing influence of "big data", these predictions are currently studied in the fields of computer science, data mining and information retrieval. Li and Tong considered predicting the number of citations as an optimization problem. For 500,000 papers in computer science, that study predicted the number of citations 10 years after publication based on the number of citations during the first 3 years after publication. Their results showed that the number of citations 3 years after publication is a useful predictor of later citations [10]. Dong et al. predicted the h-index of authors 5 years after the publication of their papers. The impact of a paper is defined using six factors: author, content, publisher, citation, co-authors and chronological order. The dataset used for that study included 2 million papers related to computer science [11]. Davletov et al. predicted citations 5 and 10 years after publication based on chronological data of citations a few years after publication, and structural information related to citation networks [12]. They used a dataset of 27,000 arXiv records for papers related to energy physics, 1.5 million AMiner records and 2 million CiteSeerX records related to computer science papers [13][14][15]. Their results show the importance of chronological citation data during the first 2 years after publication [12]. Chakraborty et al. classified chronological information related to the number of citations a few years after publication into six patterns, and predicted the number of citations over 5 years based on the features of authors, academic societies and keywords [16]. Their dataset included 1.5 million data records of computer science papers, and their results demonstrated the particular importance of the number of citations of a paper's author and the number of citations 1 year after publication. Wang et al. examined a method that predicted future citations from chronological citations over the 5 years after publication, using the power law. Their dataset included bibliographic data from three journals: Physical Review B, PNAS and Cell. The citations of 90% of the papers matched the predictions for the 25 years after publication [17].
These prediction methods are based on chronological citation data for a few years after publication, particularly the number of citations and the degree of impact. However, our objective is the "early" prediction of emerging research. This research has tried to predict the growth of citations in the near future (3 years after publication) using chronological data for the year after publication. Our method differs from existing techniques in that it uses only topological features such as network indices without domain-specific information (e.g. keywords). Furthermore, it uniquely predicts an increase of citations in the near future using chronological data obtained shortly after publication. The authors extracted structural features at different granularities from large citation networks using clustering analysis. Our model represents a novel early prediction method, integrating structural features from citation networks.

Construction of the prediction model
In this research, academic papers that had the terms "solar cell" or "photovoltaic" in their title, abstract or keywords were extracted from the Thomson Web of Science Core Collection database. Only journal papers related to the field were targeted. The information related to the target field was extracted including paper title, abstract, name of authors, year of publication and citation-related information from the dataset. From the extracted data, a citation network was created for each year, with cumulative papers as nodes and with cumulative citation relationships as links of the networks. From the created time expanded network, the features of the following classes were extracted in each paper of each year. Here, the constructed features are used to express learning data for predicting emerging research.
The features used in the prediction model were categorized into four classes: network, cluster, centrality and properties of citation. The network features represent the general features of the citation network. A cluster is defined as a set of papers that have many citations in the citation network, extracted by maximizing the modularity [18]. Centrality represents how central the paper is in terms of its position in the cited network. The degree of centrality can be represented using several methods [19][20][21][22][23][24][25]. The citation properties are the overall statistical properties: maximum, minimum, average and sum of the set of papers that a paper cites. The 63 features were used as presented in Table 1. These features were calculated for all of the papers in the largest connected component, and were used as explanatory variables. The result predicts if a paper will be emerging.
In this paper, emerging research was defined as "papers for which the incremental of citation 3 years after (t0 + 3) publication are in the top 5% of all papers published in that same year (t0) in the dataset". Based on this definition, a model was constructed that extracts the features of emerging research. For this purpose, a model used papers that are emerging 3 years after publication (t0 + 3) as the training data and applied it to data 4 years later (t0 + 4 = t1). Data published in this year (t1) is referred to as the prediction target year data. To evaluate the performance of this model, the citation number from 3 years after the prediction target year (t1 + 3) was used. Figure 1 shows the relationship between the training target period and prediction target period. Maximum of feature in question in cited paper sets that a paper cites CITING_MIN-[feature] Minimum of feature in question in cited paper sets that a paper cites CITING_AVG-[feature] Average of features in question in cited paper sets that a paper cites CITING_SUM-[feature] Sum of features in question in cited paper sets that a paper cites For example, if 2012 was the prediction target year (t1), the model requires features data up to the year 2008 (t0) and the correct data at t0 + 3. This was called the "2008 model". We can apply the "2008 model" to the data for 2012 (t1 = t0 + 4) to calculate our prediction. This prediction model was evaluated by using data from the end of 2015 (t1 + 3). Table 2 shows which data was used for each training and verification step.
The model was constructed by using a statistical machine learning method. Using knowledge from the data confirmation year, items that become emerging research were  The authors randomly extracted the negative example with the same amount as positive example sets. This process was repeated to generate multiple data sets for each year, which were then used to construct the models. To predict the model performance, the average performance of multiple models was calculated for each year. Additionally, 5-fold cross validation was implemented for each model to avoid overfitting.

Evaluation of the prediction model
The F-value was used to evaluate the analytical model. The F-value is an index defined as the harmonic mean of precision and recall. Precision is the ratio of actually emerging papers to those predicted as emerging. Recall is the ratio of papers predicted as emerging to actually emerging papers. The F-value is extensively used to evaluate prediction models.

Prediction by model constructed
In this phase, the input data was papers published between January 1, 2015 and December 31, 2015 and the papers were determined by the model predicted to be in the top 5% of papers in 2018. The forecasted top 10 papers were examined in this research.

Dataset retrieval and feature creation
Papers that included the terms "solar cell" or "photovoltaic" in title or keywords were extracted from the Web of Science between January 1, 1900 and December 27, 2015. This resulted in 121,393 papers. The earliest was published in 1906. Figure 2 shows the number of publications after 1900. There was exponential growth after the 1990's (more than 18,000 reports were published in 2015). Examining the network produced by direct citations of these papers, 112,430 papers were found that belonged to the largest connected network set. The number of annual publications was calculated as shown in Table 1 for papers in this largest connected component. The number of citations for all papers in the network was also calculated.

Model development
The negative examples were randomly extracted with the same amount as positive examples eight times to construct eight datasets for each year (corresponding to the prediction models). The precision of the results for each year was shown in Table 3. All the F-values exceeded 70, demonstrating a stable precision.  Table 4 shows the most important features for each models. They were ordered by descending importance. The feature with the highest weight in Table 4 was PageRank (CNT_PAGER) [21]. It is calculated using an algorithm that assesses the importance of a webpage and evaluates academic papers based on citation properties. This index identifies a paper cited by papers that are themselves frequently cited. Furthermore, it reduces the relative importance of papers that have citations that contain mutual citations. The next most important feature was degree centrality (CNT_DEGRE) [16]. The more a paper is cited in reference lists, the higher the index. The authority score (CNT_AUTHOR) is high for papers that represent bridges between clusters [22]. This sort of paper could generate a new, emerging research. The importance of the CITING_SUM-CL_RANK feature indicates that an increase in the number of clusters that include a paper increases its chance of becoming emerging. The sixth to ninth ranking features are based on features of papers in reference lists of papers. Table 5 shows how the top 10 papers in 2012 that were predicted to become emerging have expanded their citations in 2014, 3 years later. Papers 1, 3, 4, 6, 7, 8 and 10 in Table 5 were considered emerging in 2014. That is, 70% of the 10 papers listed in Table 5 were in the top 5% for 2014.

Prediction for papers published in 2015
Lastly, the papers published in 2015 were inputted into prediction model and the top 10 papers were listed as shown in Table 6.

DISCUSSION
This paper proposed and evaluated a method that predicts whether a published paper will become an emerging one in the next 3 years. Table 5 shows that 70% of the top 10 predictions for 2012 were correct. The proposed model was sufficiently dependable; the F-values fluctuated around 70 for all of the years, and the precision and recall values suggest that the model was accurate.
PageRank was an important predictor; a paper that is cited by frequently cited papers is therefore more likely to become emerging. Furthermore, a higher degree of centrality indicates that a paper citing many papers in its reference list will be cited in years to come. As a result of these mechanisms, many review papers could have been predicted as emerging. However, all the papers that are cited frequently in their reference lists do not necessarily develop into emerging research. Determining these papers that are very likely to become emerging could facilitate estimates of future research and development trends. Table 6 contains the predictions of the most important publications in 2018. One paper by Calogero et al. describes vegetable-based dye-sensitized solar cells. Vegetable dyes are sensitizers extracted from alga, flowers and fruit [37]. Vegetable-based dye-sensitized solar cells use these Dye-Sensitized Solar Cells (DSSC). Kay and Grätzel proposed this idea [47]. Because that paper was published, many researchers have tackled this idea. In fact, as of September 3, 2015, that report had been cited 733 times. This field should be carefully observed.
Perovskite solar cells have gained widespread attention. They are produced from cheap materials using a solution technique and so they are highly likely to be used extensively. Perovskite is a crystal structure of calcium titanate (perovskite, CaTiO3). It was named after the Russian researcher Perovski, who first reported the structure. The first paper related to perovskite photoelectric conversion was published in 2009 [48]. The National Renewable Energy Laboratory (NREL) reported a value of 20.1% as the most efficient perovskite photoelectric conversion on February 17, 2015 [49]. Chueh et al. reviewed the latest developments in solution-processed interfacial layers, which have contributed to a marked improvement in the performance of polymer and perovskite solar cells [42]. Based on the results, we can assume that perovskite photovoltaics will become an emerging research field. Two important journals (Science and Nature) highlighted perovskite photovoltaics as one of the greatest breakthroughs of 2013 [50,51].
At the end of 2018, we will be able to evaluate the predictions in Table 6. Because some of these papers deal with common themes, our method could help decision-making processes. This method becomes useful when private enterprises plan their research and development activities or when central governments make decisions related to science and technology policies.

CONCLUSIONS
This paper proposed a prediction model that uses large amounts of data to determine potential papers that will later become emerging in solar cells field. The authors succeeded to predict the growth of citations three years after publication by applying machine learning techniques to information derived from data one year after publication. The goal was to achieve an "early" prediction of emerging fields. Various features were used in the model. These features were not used in existing research. The authors used four classes of features: network, cluster, centrality and properties of citation.
Dataset contained papers that included "solar cell" or "photovoltaic" and were extracted from the Web of Science. They were published between January 1, 1900 and December 31, 2015. There were 121,393 papers in the dataset.
This paper could test the model results for 2007-2012 and found that the F-values were stable and greater than 70. This paper also forecast the results for 2018 using papers from 2015 and believes we found useful information regarding future solar power technologies.
The model predicted that a paper related to vegetable-based dye-sensitized solar cells would be emerging. Although the ideas in the paper are not new, this type of solar cell is remarkable because of the more efficient conversion. Our model predicted the incremental of citation of this paper would be remarkable in 3 years.
The conversion efficiency of perovskite solar cell power generation has increased by a factor of five and is approaching that of silicon-based solar cells (which are currently very extensively used). The cheap production cost of perovskite solar cells means that they are becoming popular. Future developments in perovskite solar cells will be very important. Many of the papers that the model forecasted to be most emerging consider common topics.
Among the papers published in 2015, an authoritative journal publisher has noted all of the forecasted top 10 papers. This demonstrates that the prediction is rational. We should test these predictions and propose guidelines to ascertain future trends, while constructing more stable models.
The rapidly increasing amount of information and complicated knowledge structures mean that it is difficult for private enterprises to manage research and development decisions, and for governments to develop science and technology policies. This model can be used to gain foresight into developing trends in science and technology, facilitating human decision making processes. The proposed model must be more important for the researchers in fields of sustainability to processes huge amounts of information in the field, analyzes it, and extracts papers that are expected to be valuable in the near future.