Update - Science-i

Update

07/22/2023 Published!! https://www.nature.com/articles/s41597-023-02383-w
04/04/2023 First review came back (see below)
02/17/2023 Manuscript sent out to peer-review
02/05/2023 Manuscript submitted to Scientific Data (submitted files available in the “Documents” page)

First review (04/04/2023)

Below is the message from the editor after the first review round. In Documents, you’ll find an Excel file where I list how I plan to respond to each comment. I will get back to the team once I finalize the revision with Jingjing, but please feel free to let me know if you have any suggestions.

———————————-

** Please ensure you delete the link to your author homepage in this e-mail if you wish to forward it to your coauthors **

Manuscript Number: SDATA-23-00161
Manuscript Title: Artificial-intelligence augmented spatial database of planted trees (A-SDPT) in East Asia
Corresponding Author: Professor Liang

Dear Professor Liang,

Your manuscript entitled “Artificial-intelligence augmented spatial database of planted trees (A-SDPT) in East Asia” has now been seen by the referee(s), whose comments are appended below. As you will see, the referees find your Data Descriptor of interest, but they raise a number of issues, which would need to be addressed before this work would be appropriate for publication at Scientific Data.

Based on the recommendation of the handling Editorial Board member, we therefore invite you to revise and resubmit your manuscript, taking into account the points raised. Please supply a detailed point-by-point response to the referees’ comments, describing how you have addressed their concerns, with your revised manuscript.

As a matter of course, we ask that you ensure your manuscript complies with our format requirements explained in full in our Submission Guidelines:
https://www.nature.com/sdata/publish/submission-guidelines/

Editorial Requests

Please note this section contains specific requests that we kindly ask you to respond to, in addition to the reviewers’ and Editorial Board Member’s comments below. Please feel free to include any responses to these requests in your ‘response to reviewers’ document if you do not feel the change will be obvious to the Editor.

* Please ensure that your references conform fully to the Nature style. See the examples at the link below:

https://www.nature.com/sdata/publish/submission-guidelines#refs>

* Please perform two standard checks, which we request for all datasets making use of input data from third parties:

1) Please cite all your sources in the reference list – preferably at the dataset level, and/or any relevant publications if more relevant (the data providers often provide guidance on how they wish to be acknowledged). For resources that do not have formal metadata, please quote the URLs in the text when they are mentioned in ()s.

2) Confirm all the sources are openly available – i.e. your re-use or re-distribution is compliant with the terms and conditions/licenses for data sharing (again, the data provider should provide guidance on this).

* Rather than use a table of references in Table S1, please add the references for the literature data to the main reference list. While I appreciate this will make the list longer, citations in supplementary documents aren’t usually tracked or counted by bibliographic databases, whereas a citation in the main list will be typeset and sent to the relevant registery (Crossref)

* I also suggest Table S2 is incorporated into the main text for readers’ ease. Please split this into two tables so it fits (our limit for typesetting is less than 1 A4 page

* Please add data citations for the Figshare datasets to the reference list using these instructions (https://www.nature.com/sdata/publish/submission-guidelines#data_citations – note that DOI URs should be used, see Example 1). Please add the reference numbers to wherever the dataset is mentioned in the text – the main position should be the first part of the Data Record in a sentence describing where the data has been deposited, and the Code Availability Statement.

Please use the following link to submit your revised Data Descriptor manuscript:

https://mts-scidata.nature.com/cgi-bin/main.plex?el=A5CZ6CQVq3A2HVwn6I6A9ftdajkM6WjWklGWrkmKEzi2wZ

** This url links to your confidential homepage and associated information about manuscripts you may have submitted or be reviewing for us. If you wish to forward this email to co-authors, please delete the link to your homepage first **

REVISION CHECKLIST:

In order to process your revised manuscript, we will require the following items:

* A point-by-point response to any issues raised by our referees and any editorial requests that require further clarification.

* The revised version of your text as a Word or TeX/LaTeX file with all changes highlighted.

* The final versions of any Supplementary Information files.

* Publication quality figures in the TIFF, EPS, or PDF formats.

* Final accession numbers or DOIs for any deposited datasets (please see any specific instructions for data citation in the Editor’s requests, above)

Any authors who have an ORCID and want it to appear on the final publication should add it to their account before you resubmit your revision. ORCIDs may not be added to our system after a paper has been accepted for publication.

We hope to receive your revised paper within three months. Please contact us if you will need more time.

We look forward to receiving your revised manuscript.

Sincerely,

Dr Guy Jones
Chief Editor
Scientific Data

—————

The Editorial Board Member for this manuscript was Atsushi Kume.

Comments from the Editorial Board Member and Reviewers:

Editorial Board Member (Comments to the Author):

Thank you for the submission of your Data Descriptor entitled, “Artificial-intelligence augmented spatial database of planted trees (A-SDPT) in East Asia” (SDATA-23-00161), to the SCIENTIFIC DATA. Your manuscript has been reviewed by two experts in your field of study. Fortunately, both reviewers appreciated the importance of the data. However, both reviewers raised some important concerns about describing the data content and how the dataset was defined. The nature of the dataset seems to vary in reliability depending on the country covered. Also, I agree that the title does not accurately reflect the content. The reviewers offer many constructive suggestions. So I would like to request the authors to improve the manuscript. I look forward to receiving your revision.

Reviewer #1 (Remarks to the Authors):

This manuscript describes a dataset of planted forests in East Asia with a spatial resolution of approximately 1km. Overall, the manuscript is well-written and provides some merits for using the dataset; however, there are some deficits. The following are comments on the questions for reviewers.
(Experimental Rigour and Technical Data Quality)
– The authors generally produced the data in a technically sound manner. They investigated and compared three machine learning algorithms, and then decided to use random forest (RF) model, which achieved the best performance in the experiment. Although this method is reasonable, they only applied this model to China, DPRK, and small parts of Japan. For other parts, they used the mapping results from other data sources. This procedure affects the consistency of the product and raises the necessity of mapping these countries in this product.
– The validation of the RF model is generally acceptable, but there are some unclear points as pointed out in specific comments below. Although the validation for mapping results only considers the extent of China, this is appropriate given that most of the other parts are not from RF models in this manuscript. They did not follow the good practice of the accuracy assessment of land cover products (as outlined in Olofsson et al. 2014, doi.org/10.1016/j.rse.2014.02.015); however, this is acceptable if it is regarded as model-based inference.
– Although the spatial resolution of the data (approximately 1 km) is not high, the dataset can still be used to address research questions. A major problem is the ambiguous range of years this dataset depicts. This might reduce the utility of this dataset. Spatial coverage is wide.
(Completeness of the Description)
– There are several unclear points that may hinder the reproducibility of the dataset, as raised in specific comments
– I found no serious issues for information provided by the authors to reuse this dataset. I found some minor inconsistencies in the dataset and main text, but they do not affect the reusability.
– The authors provide information relevant minimum information, such as spatial resolution, CRS, and mapping years.
(Integrity of the Data Files and Repository Record)
– Although there are several inconsistencies, the data files generally match the descriptions in the data descriptor.
– They deposited the data in the figshare, which is an appropriate data repository.

Major comments:
While I believe the authors have a certain motivation for generating the dataset in this study, the current structure lacks a description of the differences between this dataset and existing ones (e.g. Global extent of Planted trees). What is the advantage of the map with a 1km spatial resolution for unknown years, compared with the other data? Describing the unique characteristics and highlighting the usefulness of this dataset would be beneficial to readers.

The title of the manuscript is misleading, in my view. Before reading the main text, one might assume that the authors have generated a map using deep learning architecture, such as convolutional neural networks. However, this manuscript used (or investigated) machine learning algorithms like random forests and XGBoost. In the context of remote sensing, we do not refer to these as artificial intelligence. Furthermore, the use of the term “augmented” is misleading since this technique is frequently associated with deep learning. It is better to reconsider the title of this manuscript and dataset.

Although the authors have acknowledged in the Usage Note, the year for this map is ambiguous, which affects the utility of this dataset. It might be better to state this issue earlier in the main text to avoid confusion for readers. This is because readers cannot understand this until reading the end of this manuscript.

In my understanding, the authors focused mainly on the mapping of planted forests in China. Because most of the map results for Japan (nearly 90% of the forest area) and entire country of ROK were derived from other data sources, rather than the models in this study, it is questionable to include these regions in the dataset. In these countries, higher spatial resolution data seems to be available. The authors are recommended to provide the motivation for including these countries in the dataset.

Specific comments:
L51: it is usually difficult to state 1km as high-resolution. It is better to change.
L76: Although popular is raised as major species, the mapped results of this dataset indicate there is not so much area. Is this correct?
L103: What is the reason for selecting 1 km as the spatial resolution?
L181: There is no explanation for the one of five forest structure attributes (tree height from Potapov et al. 2020). Please add the explanation.
L184: “plant area index”?
L192: Which data was used for deriving tree height (GEDI L2B or Potapov et al.2020)?
L241: It is not clear how did the authors calculate 95% CI only using 10 iterations.
L251: I assume the authors tuned the number of variables randomly selected in each split (mtry in randomForest package), not manually limiting the number of predictors, but it is better to clarify.
L263: How did the authors choose the parameters for each RF model? (using the parameter that achieved the highest classification accuracy?)
The model training data, iterations, biome zones, and output are different for each model and this is a little complicated. Please consider generating a table summarizing key information for these models. This is just a suggestion.
L273: Because the RF models predict planted or natural forests, there should be other models or data to identify forest areas. How did the authors map forest area? Although there is a description for training data, this information is lacking for mapping.
L277: I do not think major species (or genera) are not fully covered in this study. For example, the authors stated that Chamaecyparis obtusa is a major species in Japan, but Chamaecyparis is not included. I assume this is included in Cryptomeria. I understand that all species/genera cannot be mapped, but such a compromise should be provided for the suitable usage of this dataset.
L347: The citation for the vegetation survey in Japan is not correct. The URL is for monitoring sites 1000, not for the vegetation survey. Please correct.

Dataset:
I reviewed the dataset and R script in the data repository. There are several points that are different from the description or not explained in the main text. These should be revised or mentioned in the text.

– There is “taiwan” in the Country column, which is not explained.
– Unfortunately, “myfun()” function defined in the “RF_Training.R” did not work in my environment. Please check whether this is correct.
– In several grids, the lower bound model has larger values than the upper bound model. Although this might be inevitable, it might be better to mention.
– Raster data have a lower spatial resolution.
– In Japan, there is no mapped area for Larix. This is unrealistic since Larix kaempferi is the third dominant planted species in the statistics in Japan. The vegetation survey also contains this species, thus, it must be mapped. The authors should provide additional information for this issue in the main text, maybe explaining the allocation of main species to each genus.

Reviewer #2 (Remarks to the Authors):

This research paper is a groundbreaking contribution to the field of forest research and management. The authors have successfully addressed a significant gap in the existing literature by presenting the first AI-augmented spatial database for East Asia’s planted forests. The A-SDPT database is an invaluable resource for researchers, policymakers, and practitioners alike. By providing reliable records of the geographic distribution and tree species composition of East Asia’s planted forests, the authors have laid the foundation for future work on climate change mitigation, forest conservation, and restoration efforts in the region. While the abstract for the research article offers valuable information and presents the study’s main findings, a few revisions can be made to improve clarity and conciseness. Here are some points to consider:

(Introduction) To contextualize the significance of examining East Asia’s planted forests, it is advisable to briefly refer to similar initiatives undertaken in other regions (e.g., United States, Europe, or other regions) or elucidate the distinctiveness of East Asia with respect to planted forest distribution and species composition.

(Methods) An evaluation of potential trade-offs between the spatial resolution (1 km2) employed in the study and the accuracy of the model outputs is warranted. Consideration should be given to whether a finer or coarser resolution (e.g., Landsat or other higher resolution images) might have been more suitable for the study objectives, as well as the implications of the selected resolution on the final results.

(Methods – Quality-Oriented Data Integration, QODI) It is essential to provide additional information regarding the absence of ground-sourced data specific to the Democratic People’s Republic of Korea (DPRK). Moreover, an explanation as to why the DPRK is not depicted in Figure 3 should be supplied.

(Methods – Ensemble Machine Learning Model) In this study, the ensemble machine learning model was employed, which comprised three candidate machine learning models: Random Forest (RF), Support Vector Machine (SVM), and XGBoost. A thorough justification for the selection of these models and their combination is necessary. Additionally, a brief discussion on alternative models and evaluation metrics that could have been employed, along with the rationale behind their exclusion, will be presented.

(Technical Validation) Regarding the estimation of planted forests, the obtained precision value is relatively low, at 0.63. A comprehensive discussion on the implications of this result, as well as potential causes for the low precision, is warranted.

(Technical Validation) In the section “Estimating Dominant Tree Species,” it is stated that 11 out of 17 genera achieved F1 scores higher than 0.50. To provide a more complete understanding of the model’s performance, it would be beneficial to include context on the remaining six genera. This should encompass an explanation of the possible reasons for their lower F1 scores and suggestions for improvements to the model that could address this issue.

(Technical Validation – Uncertainties) The limitations and uncertainties section should be expanded upon to provide a more comprehensive analysis of potential weaknesses in the study. For instance, discuss the impact of the limited in situ data from Japan, ROK, and DPRK on the model’s performance and generalizability. Additionally, explain how the inability to identify the spatial distribution of monoculture versus

If you no longer wish to receive any email correspondence from this journal, please click the link below to unsubscribe https://mts-scidata.nature.com/cgi-bin/main.plex?el=A1CZ6CQVq6A5HVwn1BX3A9ftdMENW9Xpkz18J37YK6W2AZ

This email has been sent through the NPG Manuscript Tracking System NY-610A-NPG&MTS

Confidentiality Statement:

This e-mail is confidential and subject to copyright. Any unauthorised use or disclosure of its contents is prohibited. If you have received this email in error please notify our Manuscript Tracking System Helpdesk team at http://platformsupport.nature.com .

Details of the confidentiality and pre-publicity policy may be found here http://www.nature.com/authors/policies/confidentiality.html