I am looking for a developer to write a program with the following function.
Functionality
Tear apart file and extract sentences from Word, and PDFs.
Tear apart a specified webpage and extract sentences and words. Sentences would be anything greater than x words.
Sentences need to be extracted. As well an option for phrases should be available as well.
Needs to handle English, Chinese, Spanish, Japanese (all UTF-8 encoded), other languages as well.
2 or more languages could exist within the same document - but each character should be recognized.
The program if possible should be written in a scripting language that would run on a webserver, MAC or PC (feel free to propose what you believe the best approach).
Data Processing
The target file would be the name of the file that was read but the extension would need to be changed to .txt.
The target file would possess 1 sentence per line.
A target txt file should be written to a specified path on a local or remote server (could be ftp).