TextSegFault - Brown Corpus Evaluation

Introduction

Description of steps to reproduce the results of the Brown corpus evaluation.

Note: The current open-source version of TextSegFault does not implement all extensions described in the paper (no dispersion based stop word filtering, no dynamic block size and no top segment selection). Therefore the results are slightly worse than reported:

Default	0.11	0.10	0.10	0.13
Number of segments known	0.11	0.09	0.08	0.06

Prepare

Download and extract the version 1.2 of the C99 algorithm from the homepage of the author.

Add TextSegFault

Create a shell script name TextSegFault in the bin directory. It should contain the commands:

#!/bin/bash

java -cp $PATH_TO_TEXTSEGFAULT_JAR net.sourceforge.textsegfault.TextSegFault $@

Modify scripts

Add lines to execute the TextSegFault in ebin/public.batch:

public.testAlgorithm $DATASET "43" "TextSegFault" 
public.testAlgorithm $DATASET "44" "TextSegFault -n 10"

Add the bin directory to the PATH and to the CLASSPATH, remove all old results with find ../data -name TestLog*.txt -exec rm \;. If you get errors in the file private.testOneCase, try to add /usr/bin/ in front of the time command.

Run the tests

Run the tests via public.BatchAll and create the summary with public.SummaryAll.

About TextSegFault

Project Documentation

Introduction

Prepare

Add TextSegFault

Modify scripts

Run the tests