AG'sNewsTopicClassificationDatasetVersion3,Updated09/09/2015ORIGINAGisacollectionofmorethan1millionnewsarticles.Newsarticleshavebeengatheredfrommorethan2000newssourcesbyComeToMyHeadinmorethan1yearofactivity.ComeToMyHeadisanacademicnewssearchenginewhichhasbeenrunningsinceJuly,2004.Thedatasetisprovidedbytheacademiccomunityforresearchpurposesindatamining(clustering,classification,etc),informationretrieval(ranking,search,etc),xml,datacompression,datastreaming,andanyothernon-commercialactivity.Formoreinformation,pleaserefertothelinkhttp://www.di.unipi.it/~gulli/AG_corpus_of_news_articles.html.TheAG'snewstopicclassificationdatasetisconstructedbyXiangZhang(xiang.zhang@nyu.edu)fromthedatasetabove.Itisusedasatextclassificationbenchmarkinthefollowingpaper:XiangZhang,JunboZhao,YannLeCun.Character-levelConvolutionalNetworksforTextClassification.AdvancesinNeuralInformationProcessingSystems28(NIPS2015).DESCRIPTIONTheAG'snewstopicclassificationdatasetisconstructedbychoosing4largestclassesfromtheoriginalcorpus.Eachclasscontains30,000trainingsamplesand1,900testingsamples.Thetotalnumberoftrainingsamplesis120,000andtesting7,600.Thefileclasses.txtcontainsalistofclassescorrespondingtoeachlabel.Thefilestrain.csvandtest.csvcontainallthetrainingsamplesascomma-sparatedvalues.Thereare3columnsinthem,correspondingtoclassindex(1to4),titleanddescription.Thetitleanddescriptionareescapedusingdoublequotes("),andanyinternaldoublequoteisescapedby2doublequotes("").Newlinesareescapedbyabackslashfollowedwithan"n"character,thatis"\n".
1