← Back to the CORE Portal
Category: dataset
Partnership Announcement: Cypris and CORE
We’re delighted to announce a new partnership between CORE and Cypris, a leading AI-driven, market intelligence platform that connects research & development (R&D) teams with innovation data and trends in their field.
The partnership will provide Cypris with unlimited access to over 210 million open access articles to further enhance their platform and regularly add live market data to provide R&D teams with the most up-to-date research in their fields of interest.
Continue reading this news on the Jisc Research Blog.read more...
15th October 2021CORE AdminAPI, CORE, dataset, newslettercore, core integration API, coreAPI, database, dataset, metadata, openacccess
Flowcite Expands its Knowledge Library with 210 Million Research Papers from CORE
Flowcite has teamed up with CORE, the world’s largest aggregator of open access research papers. The partnership will provide Flowcite users with free and unlimited access to millions of open access research papers from the CORE database.
CORE is delighted to partner with Flowcite and progress our aligned goals to make open research content available to all. By connecting our innovative solutions we continue to evolve the way research is being completed and increase the discoverability and usage of all research outputs.” Dr Petr Knoth, CORE Founder.read more...
9th April 2021CORE AdminCORE, dataset, Integrationcore, database, dataset, integration
CORE raises repository data quality by consolidating information from external datasets
Read about our work on going beyond mirroring content from our data providers to improve data quality. In our latest blog post, we present how we link CORE data to complementary scholarly sources and databases including Crossref, MAG, and ORCID.
25th February 2020CORE Adminaggregation, CORE, dataset, growth, harvestingcore, data, dataset, enrichment
CORE becomes the world’s largest open access aggregator (or how about them stats 2018 edition)
This was another productive year for the CORE team; our content providers have increased, along with our metadata and full text records. This makes CORE the world’s largest open access aggregator
. More specifically, over the last 3 months CORE had more than 25 million users, tripling our usage compared to 2017. According to read more...
18th December 2018Matteo Cancellieriaggregation, API, CORE, dataset, growth, harvesting, recommender, repositories2018, statistics1 Comment
Increasing the Speed of Harvesting with On Demand Resource Dumps
 
I am currently working with Martin Klein, Matteo Cancellieri and Herbert Van de Sompel on a project funded by the European Open Science Cloud Pilot that aims to test and benchmark ResourceSync against OAI-PMH in a range of scenarios. The objective is to perform a quantitative evaluation that could then be used as evidence to convince data providers to adopt ResourceSync. During this work, we have encountered a problem related to the scalability of ResourceSync and developed a solution to it in the form of an On Demand Resource Dump. The aim of this blog post is to explain the problem, how we arrived to the solution and how the solution works.
The problem
One of the scenarios we have been exploring deals with a situation where the resources to be synchronised are metadata files of a small data size (typically from a few bytes to several kilobytes). Coincidentally, this scenario is very common for metadata in repositories of academic manuscripts, research data (e.g. descriptions of images), cultural heritage, etc.
The problem is related to the issue that while most OAI-PMH implementations typically deliver 100-1000 responses per one HTTP request, ResourceSync is designed in a way that requires resolving each resource individually. We have identified and confirmed by testing that for repositories with larges numbers of metadata items, this can have a very significant impact on the performance of harvesting, as the overhead of the HTTP request is considerable compared to the size of the metadata record.
More specifically, we have run tests over a sample of 357 repositories. The results of these tests show that while the speed of OAI-PMH harvesting ranges from 30-520 metadata records per second, depending largely on the repository platform, the speed of harvesting by ResourceSync is somewhere in the range of only 4 metadata records per second for harvesting the same content using existing ResourceSync client/server implementations and sequential downloading strategy. We are preparing a paper on this, so I am not going to disclose the exact details of the analysis at this stage.
As ResourceSync has been created to overcome many of the problems of OAI-PMH, such as:
  • being too flexible in terms of support for incremental harvesting, resulting in inconsistent implementations of this feature across data providers,
  • some of its implementations being unstable and less suitable for exchanging large quantities of metadata and
  • being only designed for metadata transfer, omitting the much needed support for content exchange
it is important that Resource Sync performs well under all common scenarios, including the one we are dealing with.
Can Resource Dumps be the solution?
An obvious option for solving the problem that is already offered by ResourceSync are Resource Dumps. While a Resource Dump can speed up harvesting to levels far exceeding those of OAI-PMH, it creates some considerable extra complexity on the side of the server. The key problem is that it creates the necessity to periodically package the data as a Resource Dump, which basically means running a batch process to produce a compressed (zip) file containing the resources.
The number of Resource Dumps a source needs to maintain is equal to the number of Capability Lists it maintains times the size of the Resource Dump Index. The minimum practical operational size of a Resource Dump Index is 2. This is to ensure we don’t remove a dump currently being downloaded by a client during the creation of a new dump. As we have observed that a typical repository may contain about 250 OAI-PMH sets (Capability Lists in the ResourceSync terminology), this implies the need for a significant data duplication and requirements on period creation of Resource Dumps if a source chose to use Resource Dumps as part of the harvesting process.
On Demand Resource Dumps
To deal with the problem, we suggest an extension of ResourceSync that will support the concept of an On Demand Resource Dump. An On Demand Resource Dump is a Resource Dump which is created, as the name suggests, whenever a client asks for it. More specifically, a client can scan through the list of resources presented in a Resource List or a Change List (without resolving them individually) and request from the source to package any set of the resources as a Resource Dump. This approach speeds up and saves processing on the side of both the source as well as the client. Our initial tests show that this enables ResourceSync to perform as well as OAI-PMH in the metadata only harvesting scenario when requests are sent sequentially (the most extreme scenario for ResourceSync). However, as ResourceSync requests can be parallelised, as opposed to OAI-PMH (due to the reliance of OAI-PMH on the resumption token), this makes ResourceSync a clear winner.
In the rest of this post, I will explain how this works and how it could be integrated with the ResourceSync specification.
There are basically 3 steps:
  1. defining that the server supports an on-demand Resource Dump,
  2. sending a POST request to the on-demand dump endpoint and
  3. receiving a response from the server that 100% conforms to the Resource Dump specification.
I will first introduce steps 2 and 3 and then I will come back to step 1.
Step 2: sending a POST request to the On Demand dump endpoint
We have defined an endpoint at https://core.ac.uk/datadump . You can POST it a list of resource identifiers (which can be discovered in a Resource List). In the example below, I am using curl to send it a list of resource identifiers in JSON which I want to get resolved. Obviously, the approach is not limited to JSON, it can be used for any resource listed in a Resource List regardless of its type. Try it by executing the code below in your terminal.
curl -d ‘[“​https://core.ac.uk/api-v2/articles/get/42138752″,”https://core.ac.uk/api-v2/articles/get/32050″]​‘ -H “Content-Type: application/json” https://core.ac.uk/datadump -X POST > on-demand-resource-dump.zip​read more...
17th March 2018Petr KnothCORE, dataset, repositoriesfast sync, harvesting, open access, repositories, resource dumps
CORE’s Open Access content has reached the Moon! (or how about them stats 2017 edition)
For yet another year (see previous years 2016, 2015) CORE has been really productive; the number of  our content providers has increased and we have now more open access full text and metadata records than ever.
Our services are also growing steadily and we would like to thank the community for using the CORE API and CORE Datasets.
We also offer other services, such as the CORE Repositories Dashboard, CORE Publisher Connector and the CORE Recommender. We received great feedback with regards to the CORE Recommender, with George Macgregor, Institutional Repository Manager at Strathclyde University, reporting:
We are thrilled that this year CORE made it to the moon. Our next destination is Venus.
The CORE Team wishes you Merry Christmas and a Prosperous New Year!
* Note: Special thanks to Matteo Cancellieri for creating the graphics in this blog post.
21st December 2017Matteo Cancellieriaggregation, API, CORE, dataset, growth, harvesting, recommender, repositories2017, statistics
CORE now offers 5 millions of open access full-text research papers
CORE is thrilled to announce that it currently provides 5 millions of open access full-text papers.
CORE’s data providers from around the world
In the last year, we have managed to scale up our harvesting process. This enabled us to significantly increase the amount of open access content we can offer to our users. With more and more open access content being made available by data providers, thanks to recent open access policies, CORE now also captures and provides access to a higher percentage of global research literature ”, says CORE’s founder, Dr Petr Knoth.
With 66 million metadata records and 5 million full-text, from 102 countries, in 52 different languages, CORE becomes now the world’s largest full-text open access aggregator. CORE embraces the vibrant collections of both institutional and disciplinary repositories, while its large volume of scholarly outputs ranges from scientific research papers, to grey literature and from Master’s to Doctoral thesis. In addition, it is a metasearch for the all the open access peer-reviewed scientific journal articles published in open access journals. read more...
3rd February 2017CORE Adminaggregation, CORE, dataset, growth, harvesting, repositoriesfull-text, metadata, statistics, tdm1 Comment
CORE’s open access and text mining services – 2016 growth (or, how about them stats – 2016 edition)
The past year has been productive for the CORE team; the number of harvested repositories and our open access content, both in metadata and full-text, has massively increased. (You can see last year’s blog post with our 2015 achievements in numbers here.)
There was also progress with regards to our services; the number of our API users was almost doubled in 2016, we have now about 200 registered CORE Dashboard users, and this past October we released a new version of our recommender and updated our dataset.
Around this time of the year, the joyful Christmas spirit of the CORE team increases along with our numbers.  Thus, we decided to recalculate how far are the CORE research outputs – if we had printed them – from reaching the moon (last year we made it to 1/3 of the way).
We are thrilled to see that this year we got CORE even closer to the moon! We would also like to thank all our data providers, who have helped us reaching this goal.
Fear not, we will never print all our research outputs, we believe that their mission is to be discoverable on the web as open access. Plus we love trees.
Merry Christmas from the CORE Team!
* Note: Special thanks to Matteo Cancellieri for creating the CORE graphics.
 
19th December 2016​nancypontika​aggregation​, CORE, dataset, growth, harvesting, recommender, repositoriesopen access, statistics, text mining
Analysing ORCID coverage across repositories through CORE
* This post was authored by Matteo Cancellieri, Petr Knoth and Nancy Pontika.
Last month, CORE attended the JISC ORCID hackday events in Birmingham and London. (ORCID is a non-profit organisation that aims to solve the author disambiguation problem by offering unique author identifiers). Following the discussions that sparked off at the two events, we decided to test the CORE data towards ORCID’s API and we discovered some information that we think is of interest to the scholarly community.
Currently, CORE has data for 5.5 million unique Document Object Identifiers (DOIs) linked to records in our database (both metadata only and full text). Based on this number, we wanted to find out how many of these DOIs were connected to an ORCID id. Therefore, we set up a script that called the ORCID API obeying to the rate limit. In around 7 days we had collected the full results.
From the 5,523,577 articles with a DOI that existed in the CORE collection, we discovered that 196,713 different authors had an ORCID id, and 927,645 articles included at list one ORCID id.
Credits to Aristotelis Charalampous for helping us generate this really cool map visualisation
We found that 16% of the DOIs in CORE are connected to at least one author registered in ORCID. The following map shows the distribution of the ORCID ids discovered across the world.
Why is this useful? It enables us to assess the ORCID’s coverage across a large multidisciplinary dataset of Open Access papers. Doing some more digging in the data (we haven’t done this yet), it would also be possible to analyse the growth of ORCID over time. These data can be sliced and diced according to various criteria, such as geographical coverage or repository, to understand how ORCID coverage can be improved.
Based on our results, the UK has the biggest number of ORCID IDs. However, this result is a bit skewed by the fact that CORE has an excellent content coverage across UK repositories.
We also tried to find authors with ORCID IDs who deposited content in one of the UK repositories. Our result indicates that 68,849 ORCIDs were discovered from 254,467 unique DOIs. It was then very useful to look at the distribution of the top 15 repositories based on ORCID IDs across the UK repositories. This analysis can be extremely helpful in identifying repositories with low ORCID coverage and encouraging them to take an appropriate action.
Repositories implementing RIOXX have already the possibility to expose ORCID IDs through an attribute in the rioxxterm:author tag. While this opportunity exists, our quick survey showed that only few repositories supporting RIOXX have implemented it. Thanks to John Salter, Software Developer at Leeds University, for his help in collecting the data and creating the chart. John is currently working on including the ORCID IDs in the White Rose repository, which “forced” us 🙂 to use a log-scale in the chart due to the widespread implementation of the ID attribute in their metadata (>6k ORCID IDs vs less than 10 IDs from the other repositories)
Dataset
We have made the dataset available online on Github and it can be found here .
There are few caveats in the data that must be taken into consideration. Our main challenge was that some of the aggregated DOIs were not valid or pointing to a journal instead of a paper. The ORCID API returned only a partial match and in the case of the journals’ DOIs, this meant that the ORCID IDs returned results that regarded all the authors of one journal instead of one specific paper.
What next?
In this preliminary study we realised that the information we extracted from the data was useful to us and, perhaps, could be useful to repository managers. Our plan is to design and implement a new functionality in the CORE Repositories Dashboard. We are planning to submit a proposal for this to OR2017 and we would really appreciate your feedback. If you are a repository manager and you want to know more, contact us.
21st October 2016Matteo CancellieriCORE, dataset, repositories, Uncategorised1 Comment
CORE released a new Dataset
We are pleased to announce that we have released a new version of our dataset, which contains data aggregated by CORE in a downloadable file.
It is intended for (possibly computationally intensive) data analysis. Here you can read the dataset description and the download page. If you need fresh data, and your requirements are not computationally intensive, you can also use our API.
13th October 2016CORE AdminCORE, dataset, harvestingAPI, core, dataset, harvesting, open access, text mining
Except where otherwise noted, content on this site is licensed under a Creative Commons Attribution 4.0 International license .
Proudly powered by WordPress