Introduction
Description of the split of the Reuters RCV1 & RCV2 corpus.
Split
All documents from 1996 were used for training, and all documents from 1997 were candidates for the test documents.
Preprocessing
The Reuters document contain lots of information that could be used as indicators of document boundaries. Therefore all traces have been removed to prepare the documents.
- Iterate over all paragraphs
- Remove whitespace from the beginning and end of each paragraph
- Remove all paragraphs that start with "(c) Reuters"
- Remove " (Reuter) - " from the first sentence
- Remove all paragraph that match the Author Pattern: "(\p{L}{3}+\/)*\p{L}{3}+\."
- Iterate over all sentences
- Remove whitespace from the beginning and end of each sentence
- Remove all sentences that start with "(c) Reuters"
- Remove all sentences that match the Author Pattern
- Remove all sentences that contain more than 10 digits and less than 100 characters
Download
The split together with the ids of the documents used for training and testing are available in the download section of the project.