Introduction
Textanz is text analysis tool that calculates frequencies of repeatable
phrases and wordforms in document. The information obtained from
Textanz is useful for :
- checking your writing for overused terms and
"words-parasites". This is very important in business
documentation/correspondence, translations, literature, and just good
for personal opuses, diary etc.
- analyzing someone's text: find out his/her favorite words
and constructions, evaluate writing style.
- finding cross-quoting in two or more texts.
- calculate word density in html page and choose keywords for
META-tags -for web-masters, promoters etc.
- professional linguistics research.
Textanz is not text editor and does intend to compete against the army
of excellent well-known editors created for various specific formats.
Instead, Textanz is attempting to recognise known document
types and extract plain text from them for analysis. This extraction
job is delegated to Apache Tika suite of parsers. Tika project page
contains link to the list of supported formats so you can always check
if particular exotic one can be understood by Textanz. Of
course, all popular formats are supported : html, xml, rtf, pdf, MS
Office, OpenOffice .
Installation
Once you have downloaded textanz.zip, do the following steps :
-
Make sure that you have Java Runtime 1.6+ installed on your PC. This check is available online at
http://www.java.com/en/download/installed.jsp . If it is not installed or has version older than 1.6 ,
follow online guide to download and install the latest version.
-
Unpack textanz.zip to any empty directory on the local drive.
- Windows users: go to bin\windows and run textanz.exe to
check your installation. If Textanz is unable to find Java Runtime in
your system, the download page java.com
will open. Security settings in your system may restrict this operation, then perform it manually : open the link http://www.java.com/download in your browser. Follow the instructions to install Java on that page , then restart textanz.exe .
- Unix users: go to bin\unix and edit gui.sh - set TEXTANZ_HOME path
to directory where Textanz was unpacked. Test your setup by running
gui.sh.
If everything is correct, Textanz main window should open.
Terms and
principles
Word.
Textanz does not use dictionaries or language-specific rules. Any
sequence of alphanumeric characters is being treated as word. Examples
of words :
book
WinXP
2011
12Ae8N$$$
Phrase is
any combination of N words (N > 0), that does not contain phrase
delimiters. Delimiters are characters that normally terminates the
sentence : point, question mark, exclamation mark. All other word
delimiters - semicolon, comma etc. are just ignored. Different form of
spacing and new linefeeds also does not make difference. The only important thing for Textanz is
sequence of words. Example of two equal phrases :
Don't worry be happy.
Don't worry,
be
happy!
Wordform.
Since Textanz does not know morphology of any language, any continous
part of the word is being considered as wordform. The text can be of
natural language, fantasy language,
programming language, set of digits.
Frequency
is number of occurrences of particular phrase, word or wordform in
text. Textanz calculates all frequencies greater than 1 , i.e. any
fragment repeated at least twice
will be found.
Loading the text
Use "Text" menu or toolbar buttons to load the text for analysis. Textanz offers 3 ways of loading:
-
"Open file" dialog. This is the standard way for all applications -
just use a dialog to locate file on local disk(s) or network.
-
"Open URL" dialog. You will be prompted to enter web-page address of
html (pdf, doc, xml etc.) page to download and analyse.
-
"Paste from clipboard". If system clipboard contains any text, it
will be copied to Textanz.
Please note that extraction from complex document may cause some delay
before you see text in Textanz. Additional time is required to download
document from remote
URL. The text appeas on the right
pane. Whenever this tab is not empty, "Calculate" menu items are
enabled and text can be analysed for frequencies.
You can open multiple documents in Textanz. Each document will be opened in separate tab.
Calculating frequencies and seeing results
Use "Calculate" menu or or toolbar buttons to start frequency
calculation for phrases or wordforms. Textanz willl
display progress indicator during the calculation , and "Cancel"
action is available to interrupt the process. Once the calculation is
finished, program populates frequency table with results.
If multiple text tabs are opened, Textanz will calculate frequencies across all loaded documents.
Phrase frequencies table contains columns :
"Phrase" - the phrase itself
"Frequency" - number of occurrences
"Length" - number of words in phrase
Wordform frequencies table contains columns :
"Wordform" - the wordform
"Frequency" - number of occurrences
"Length" - number of characters in wordform
The default sorting order for phrase frequency records is by phrase
length , then by frequency. Wordforms are sorted by frequency then
length. Column titles in both tables acts as buttons which will reorder
records in acsending or descending order by the corresponding column.
Filter box above the table allows to limit displayed phrases or
wordforms to only those containing the typed string.
In order to see the occurrences of phrase or word in source text,
select the table row and use "Highlight positions" menu item or toolbar
button or mouse doubleclick. You can select multiple rows by using
CTRL or SHIFT key together with mouse button. Navigation markers on the
left egde of pane works as hyperlinks scrolling position into the
view. Please note that if calculation was made for multiple texts,
particular text tab may contain no occurrences of some word/phrase. In
such case , "NO OCCURRENCES" message will be displayed.
Using "Shift" and/or "Ctrl" with mouse, you can select multiple rows in
a table and then highlight positions for them all in the text.
"Highlight positions" and "Clear highlighting" commands applies to the
selected text tab only. These functions will not be available if text
was loaded to the tab already after the calculation (new
calculation is required to add results for this text).
Exporting calculation results
Textanz offers 3 forms of export to external files :
- Tag cloud (HTML) : exports list of words and phrases to HTML
file. Sizes of elements in resulting file are proportional to their
frequencies. This is a well known form of
frequency presentation in Web.
- Text (CSV) : exports results to comma-separated text file. CSV
format is supported in MS Office and can be opened as MS Excel
sheet.
- XML : exports results to well-formed XML that can be parsed or transformed with 3-rd party software at your choice.
Invocation of any export action opens standard "Save as" .. dialog with
corresponding default file extension . Dialog message informs about the
sussessful export upon completion.
Configuration
Configuration dialog is available via "Calculation".."Configuration"
menu item or toolbar button. It contains the following tabs :
General : configuration parameters for all types of calculation :
- Case-sensitive : whether Textanz will make differnce for upper/lower cases .
- Non-alphabetical word characters : characters to be treated as valid parts of words, except of letters.
Phrases : parameters specific to phrase frequency :
- Min.phrase frequency : minimal frequency that Textaz should respect and show in results.
- Min.phrase length : minimal number of words in phrase that Textaz should respect and show in results.
- Phrase terminator characters : characters that terminates
the phrase (sentence) in text - normally point, quotation and
exclamation marks.
Wordforms : parameters specific to wordform frequency :
- Min.wordform frequency : minimal frequency that Textaz should respect and show in results.
- Min.wordform length : minimal number of characters in substring that Textaz should respect and show in results.
Language : parameters specific to text language
- Use language settings : if checkbox is not set, all other
parameters in this tab are ignored . If checkbox is set, the language
should be selected from dropdown list next to checkbox. Textanz does
not autodetect language of loaded texts.
- Ignore common words : instruction for Textanz to ignore
and do not display frequencies of commonly used individual words :
prepositions, pronouns , articles etc.
The following buttons can be used in Configuration dialog :
Apply - settings will be used
in current session but not saved. Next time you launch Textanz the
previous configuration will be restored.
Save - settings will be saved in configuration files and used all the next times.
Close - close the configuration dialog.
Textanz saves configuration in /conf subdirectory of TEXTANZ_HOME , or
in user home directory of your operating system if TEXTANZ_HOME is no
set .
Registration
Textanz application is being distributed as trialware. That means after
the trial period of 30 days user must
purchase a license to continue using this tool for either personal or
professional purposes. We believe that you are interested in further
evolution of Textanz and respect the work already done.
Registered users gets future versions of Textanz without additional
charge. "Check for updates" command will open web-page with Textanz
change log. You are always welcome to send comments, suggestions
and update requests to
info@textanz.com.