TextSegFault - Reuters RCV1 & RCV2 Split

Introduction

Description of the split of the Reuters RCV1 & RCV2 corpus.

Split

All documents from 1996 were used for training, and all documents from 1997 were candidates for the test documents.

Preprocessing

The Reuters document contain lots of information that could be used as indicators of document boundaries. Therefore all traces have been removed to prepare the documents.

Iterate over all paragraphs
Remove whitespace from the beginning and end of each paragraph
Remove all paragraphs that start with "(c) Reuters"
Remove " (Reuter) - " from the first sentence
Remove all paragraph that match the Author Pattern: "(\p{L}{3}+\/)*\p{L}{3}+\."
Iterate over all sentences
Remove whitespace from the beginning and end of each sentence
Remove all sentences that start with "(c) Reuters"
Remove all sentences that match the Author Pattern
Remove all sentences that contain more than 10 digits and less than 100 characters

Download

The split together with the ids of the documents used for training and testing are available in the download section of the project.

About TextSegFault

Project Documentation

Introduction

Split

Preprocessing

Download