Archival Data and Text Analytics to Track 19th Century Late Blight.

Jean Ristaino: NC State University

<div><em>Phytophthora infestans</em> first caused disease in the US during 1843 around the ports of Philadelphia and New York. Ireland’s potato crop was destroyed in 1845. The potato blight caused devastation for many years and led to mass starvation and emigration from the country. There are several theories about the origin of the disease and the source of the 19th century outbreaks. We used historical text documents from contemporary literature of the time to investigate spatial information about disease outbreaks, pathogen origin and spread, and methods of management. The methodologies for automatically extracting information from these voluminous data sources will be discussed. The geographic locations that are proximate in the text to key terms related to potato blight were identified and mapped. Data sources include agricultural documents with extensive discussions of crop yields and failed crop, seed tuber exports and import, weather conditions and along with location names. Natural language processing tools were applied to automate text mining of the data within narrative passages. Specifically, we used text analytics tools from the Natural Language ToolKit (NLTK) and location name extraction and disambiguation from Bericos CLAVIN geoparser. NLTK and CLAVIN were coupled to mine the relationships between locations and reports of potato disease. Interestingly, the maps of US 19<sup>th</sup> century late blight and modern 21<sup>st</sup> century disease are strikingly similar. An interactive web mapping tool was developed for users to spatially explore the pertinent data for trends in the emergence of 19th century late blight. New insights from archival data may change our view on the source and spread of this devastating disease</div>