By Josh Froelich
You are an employee at a venture capital
company. The company recently received a proposal to fund a project aimed at
developing a new type of collaborative database. You have been assigned to
research how and why corporations are using database technologies today. You
need to find out whether the proposed project is indeed innovative and
justifies the investment. An Internet
search for “collaborative databases” recovers 7,519 documents.
The first source you uncover is about eleven
pages in length. You want to know
quickly whether or not this source contains relevant and useful information
about your project. But you do not have
the time to read the entire document just to answer this question. This tutorial shows how TextAnalyst quickens
and simplifies the analysis.
Article
Background: Databasing in the 90’s, “Data and What We're Doing With It”,
By Jennifer Barrett, Acxiom Corporation.
1.
From
Windows, select Start Menu | Programs | TextAnalyst 2.0 | TextAnalyst
2.
While
TextAnalyst loads in the background, you are presented with the Startup window
which offers three main actions:
1.
Select
the top hat icon to analyze new texts and create a
knowledge base.
2.
An
Open file dialogue box appears specifying that you are looking in the
TextAnalyst Tutorials Folder.
3.
Double
click on the Examples folder.
4.
Double
click on the file named “Databasing in the 90.txt”. This opens the file in TextAnalyst.
Once
the file is opened TextAnalyst analyzes the file. As TextAnalyst analyzes text, it determines what concepts - word and
word combinations - are most important in the context of the investigated text.
Each concept is labeled as a node and assigned a numeric semantic weight – the
measure of the probability that this concept is important in the studied text.
Simultaneously, TextAnalyst determines the weights of the relations between
individual concepts in the text and hyperlinks concepts to those fragments,
sentences, in the original text where these concepts have been encountered.
Nodes terms are placed in quotes within this tutorial.
The
resulting structure, called Semantic Network, is a set of the most significant
concepts distilled from the analyzed texts, along with the semantic
relationships between these concepts in the text. The Semantic Network is a
cyclical graph holding all the most important information from the investigated
text in a very concise form. If we were to visualize Semantic Network, it would
be similar to molecular structure. All atoms within a molecule are
interconnected either directly or through joint neighbors.
Mathematical
algorithms inside TextAnalyst determine the relative importance of a text
concept, solely by analyzing its connections to other concepts in the text.
Therefore, TextAnalyst creates the semantic network without using background
knowledge of the subject. TextAnalyst implements algorithms similar to those
used for text analysis in the human brain.
When
desired, background knowledge may be added by the user through an external
dictionary to fine tune TextAnalyst to a particular subject.
TextAnalyst
can be divided into three main viewing sections or panes. The top left pane is called the view pane.
The user can use tabs to switch between five different views within the view
pane that are as follows: Document list, Topic structure, Semantic network,
Semantic search, and Search. By default the view pane displays the topic
structure tree of the investigated text. The top right pane is the results pane
and is currently blank. The bottom text
pane contains the original “Databasing in the 90.txt” text in full length.
Step
Four: Understanding the Topic Structure View, Results and Text Panes
You
are trying to develop some ideas about the relationships between concepts that
you are researching for your report.
You want to know more about the selling of databases. Instead of reading the entire text, you can
interact with the semantic network to easily discover more about the role of
sales and using databases in the 90’s.
Each
node in the semantic tree in the view pane contains a concept and looks like
this: The number to the left of the word, 99 in
this case, represents the semantic weight of this concept, changing from 0 to
100. The different types of fish visualize the rough semantic weight of the
concept. Initially, all nodes except the root are closed. By double clicking on a parent node its
children nodes become visible. These children nodes may also contain children
nodes, like a family tree.
1. Double
click on the black whale next to the node “database” in the view pane. A tree structure forms under the node
database.
The
two numbers located next to each node under “databases” represent different
semantic weights. For example, look at
the node “businesses”, which is preceded by 59 99. The first number, 59, refers to the weight, or strength, of the
semantic relationship of the node “businesses” to the parent node
“databases”. The second number, 99, refers
to the semantic weight of the word “businesses” to the entire text.
2. Double
click on the node “sold”. Click the
node “<ALL>” under sold. Every
sentence containing the word “sold”, or similar words such as “sell” and
“selling,” will appear in red the results pane. Sentences appear in the text pane in the order of which the sentences
are contained in the full text.
In
TextAnalyst, every time you see an important concept or word in the results
pane that is contained in the view pane’s tree structure the word will be
colored red.
You
become interested in the sentence “It is fair to say most consumers do not
realize the scope of information that is maintained on them, nor do they
understand the economics of what that data can do to reduce the costs of
developing and selling products and thus the ultimate cost of the product
itself.” You wish to better understand
the sentence in its surrounding context, and wish to know where the sentence is
located in the full text.
3. Double
click on the sentence listed above in the results pane.
Notice
that the sentence becomes highlighted in the results pane, and the sentence is
found in the full text also and is highlighted. You can now read the paragraph and surrounding sentences in which
the highlighted sentence is contained to gain a better understanding.
There
are several sentences that contain the word sold or an alternate form of
it. You wish to narrow your scope even
more to find out specifically about the selling of databases.
4. Single
click the node “sold” in the tree structure directly above the “<ALL>”
node you clicked in step 2. Now the
results pane shows only sentences containing both the word “sold” and
“databases” or the words similar forms.
Step
Five: Diving Into the Subject Even More
You
want to analyze the importance of sales in Databasing in the 90’s. You know how to see a list of sentences
containing the term sales, but you want more ability to focus your search to
know how sales, through looking a the term sold, ties into companies and databases.
1. In
this case you are looking for the node “sold”, located under the top node
“databases”. TextAnalyst can tell that
the term “sell” is similar to “sold”.
2. Under
the node “sold”, double click the node “companies”. Notice the sentences in the results pane.
The
sentences are not specific enough to your search for “companies,” “sales” and
“databases”.
3. In
the toolbar at the top of the program, locate the seventh icon form the left
“Include all parents”. The icon in the
toolbar looks like this.
4. If
you hover over the icon a tool tip appears saying “Include all parents.” Click this icon. After clicking the icon it should remain in the pressed state. You have activated “Include all parents”.
5. Notice
that now there is only one sentence in the results pane that contains all three
words, “databases”, “sold”, and “company”.
You
have effectively narrowed your scope.
6. Press
on the icon again to turn off “include all parents.”
7. Double
click on the whale next to the top node “databases” to return the semantic
network back to its default closed view. You should only see the node
“databases”.
Step
Six: Using the Dictionary
In
your report you are specifically interested in certain keywords but are unsure
if TextAnalyst will retrieve them from the text due to their possibly low
semantic weight. You want to edit some
of the words the TextAnalyst uses to determine its makeup of the semantic
network as described in step four of this tutorial. You want to add the word “personalized” as in your report you
wish to more closely examine personalized databases.
3. From
the main file menu in TextAnalyst, select Settings | Edit
Dictionaries.
4. This
starts the VocEdit
application, a dictionary program that TextAnalyst uses in certain
circumstances. This is the only
language dependant area of TextAnalyst.
5. Right
click anywhere in the left window of VocEdit. Make sure not to right click on a
word.
6. A small menu appears with the words Add
and Find in bold.
7. Select
Add. An entry with the text New Entry
is highlighted. Before clicking
anything else type “personalized” without the quotes. Press the enter key. The
word “personalized” is added to the dictionary.
8. Right
click on “personalized”. A small menu
appears. Select user word. This
will tell TextAnalyst this is an important word.
You
have successfully added “personalized” the dictionary.
9. Click
Exit in the lower right corner of VocEdit.
10. A
dialog box appears asking you to save your changes. Click Yes.
11. A
dialog box appears asking if you want to replace the current dictionary file
with the new file, or save the new file under a different name. Click no to save the file under a
different name.
12. A
Save as dialog box appears. You should
be in the TextAnalyst folder. You are
provided with a default name of TextAnalyst 2.dic.
13. Name
the file “mydictionary.dic”.
14. Click
Save.
15. VocEdit
will save the file and close. Return to
TextAnalyst.
The
next step is to link the new dictionary to TextAnalyst by telling TextAnalyst
to use your dictionary in place of the default dictionary.
16. From
the Main file menu, select Settings | General settings.
17. A
dialog box appears titled General settings.
Select the Analysis tab near the top of the dialog box.
18. Locate
on the tab where it says Dictionary:
19. The
current dictionary is the default dictionary.
Click on the button to the right of the current
dictionary.
20. An
Open dialog box appears. Locate and
select the new dictionary file titled mydictionary.dic.
21. Click
Open. Return to the TextAnalyst
Program.
22. Click
OK to apply the new settings.
The General settings dialog box disappears. TextAnalyst applies the new dictionary.
23. At
the bottom of the view pane locate the third tab from the left. Notice that the word “personalized” is
located at the top of the new semantic network in bold. Look at Using the Semantic Network to
work with your new findings.
Step
Seven: Using and Understanding Summary Analysis, Changing the Threshold
You
have analyzed a few relationships so far, and better understand TextAnalyst’s
look and feel. You want to compose an overview
of the entire text you are analyzing, not just bits and pieces. Your boss wants a short introductory summary
of some of the sources for the report.
TextAnalyst
can create multiple length summaries of full texts.
1. From
the main file menu click on Analysis | Summarization.
2. TextAnalyst
performs the summarization that is now displayed in the results pane.
Notice
that the view pane is no longer in its semantic network view. It now displays some statistics about the
summary it performed. The percent of
text size next to the summary is about 14% of the entire document. TextAnalyst enables you to summarize the
entire document to a fraction of its size, and still manages to retain
significant meaning in the summary.
During
summarization, TextAnalyst determines the semantic weight of each sentence and
displays in the results pane only sentences with a semantic weight higher than
the threshold. The default threshold is
90. Currently all sentences with a semantic weight of 90 and higher appear in
the results pane.
The
summary lists the most important sentences in the context of the original
text. The summary chooses the sentences
on the basis of concepts and relationships between concepts in the full text.
You
really like this summary being sized only 14% of the entire text. However, you want a more concise summary.
TextAnalyst
allows you to change the size of your summary by changing the semantic weight
threshold. The default as mentioned is
90, so for any summary with the default threshold, all sentences with a
semantic weight of 90 to 100 are included, 100 being the maximum height. By increasing the semantic threshold you can
decrease the size of the summary.
TextAnalyst also allows you to view the semantic weights of each
sentence in the results pane.
1. Click
the Hammer icon in the toolbar.
2. A
Settings menu appears.
3. To
display weights next to sentences, check the box next to “Display semantic
weights of sentences.”
4. Click
Apply. If you look at the results
pane you can now view the semantic weights next to each sentence. After viewing the weights return to the
Settings menu.
5. Adjust
the semantic weight threshold by using the arrow buttons or typing the number
you wish to use. Change the number from
90 to 99. This means that only concepts with a semantic weight of 99 and 100
are included, giving you a shorter summary.
6. Click
Apply. Click OK.
The
summary is recalculated and created with the new threshold. Only the most important concepts within the
full text were used to create the summary, those with the weight of 99 and 100.
7. Uncheck
the box in the Settings menu to hide the viewing of the semantic weights.
8. NOTE: Now that you have performed a summary, the
view pane has changed. Notice the five
little tabs with pictures on them just at the bottom of the view pane. After doing the summary, you are now looking
at the summary tab of the view pane. To
change back to the tab you were using at the beginning of this tutorial, click
on the second tab from the left, the semantic network view tab. See Understanding the Topic Structure
Tree for more details about the tabs in the view pane.
Step
Eight: Using the Semantic Search
You
have several questions you are trying to answer in you report. One of the questions is what companies are
renting customer lists? You want to
find information from the text that answers that question without searching the
entire text.
TextAnalyst
allows you to perform a semantic search on the full text.
1. From
the main menu, select Search | Semantic Search. A Semantic search window appears. Note that there is already a sentence in the
query area. TextAnalyst enters into the
query box the selected text from the full text you highlighted in an earlier
step in this tutorial. This is done to
ease how much you have to type, so that you can click a sentence, and then
perform a search using that sentence.
2. Delete
the current sentence, as it is not currently relevant to this step.
TextAnalyst
can accept searches that are made out of full sentences or questions. This type of search is often called a
Natural Language Query. You can type in
your question exactly as it is formed in your head and click Search, instead
of having to root out keywords or phrases.
This greatly simplifies the search process.
3. In
the enter query area, type the following question:
What are the companies renting customer
lists?
4. Click
Search. TextAnalyst performs a
semantic search.
5. In
the results pane are sentences from the original text that are most relevant to
your question. The results tell about
how companies are renting the lists.
6. More
importantly, the view pane now contains a topic-oriented tree structure based
on the question you typed in the semantic search. This sub-tree of concepts that are related to the query in the
context of the present text can help you if your results do not answer your
question. You can browse through them
the same way you browse the topic structure tree you learned earlier in the
tutorial. This tree actually can help
you simulate a better answer to you question, as it shows that some words you
might not have considered connected are very important for your search answer.
Step
Nine: Outputting to HTML
You
wish to share some of your findings from using TextAnalyst with colleagues at a
company branch in London. TextAnalyst
allows you to do this through the medium of the Internet by exporting your
results to an HTML knowledge base.
TextAnalyst
can export results to a file in web format.
1. From
the Main menu, click File | Export to HTML …
2. A
Save as dialogue box appears. You can
save the file to anywhere on your computer so long as you can remember its
location. Save the file in the Examples
folder. It is named by default as ExportHTML.html. Click OK.
3. View
the file in a web browser such as Microsoft Internet Explorer or Netscape
Navigator. To do this:
4. Open
a web browser. From the Main menu,
select File | Open. A
dialog box appears to find the file.
5. Click
Browse to find the file on your computer.
Default Location C:\Program Files\Megaputer
Intelligence\MicroSystems\TextAnalyst 2.0\Examples\ExportHTML.html
In
the html file key concepts are hyperlinked.
By clicking on a concept, you can view the sentences in which the
concept is found to be important. By
clicking on one of these sentences you can then view the sentence and the
concept in the context of the full text.
Clicking on the concept again will return you to the topic list.
6. You
now have a file you can save on to your company’s web server and is ready to be
published so your corporate branch colleagues may view it in London.
7. Close
the browser and return to the TextAnalyst program.
Step
Ten: Exporting to External Applications in a CSV file
Some
of your surrounding employees are experts in Excel and they wish to be able to
analyze some of your findings to produce additional reports. You wish to send the data from TextAnalyst
out of the program so other computer applications can work with it. Through
this export you can view a list of key words and their frequency and semantic
weight.
From
the main file menu, select File | Export.
1. A
Save as dialog box appears. Check that
you are in the Examples folder. The
file has by default been named for you as ExportBase.csv. CSV stands for comma separated value, and is
a common file format used by many computer applications.
2. Click
Save. The dialog box disappears. You have successfully exported the file.
3. This
tutorial will use Microsoft Excel as the example external application
that works with your exported data.
4. Open
Microsoft Excel.
5. From
the main file menu in Excel, select File | Open.
6. Locate
the file ExportBase.csv. In this
tutorial you saved the file in the Examples folder, which by default is located
at:
C:\Program Files\Megaputer
Intelligence\MicroSystems\TextAnalyst\Examples\ExportBase.csv
Note: The Open file dialogue box should
be set to open Files of Type: All Files (*.*) or else the .csv file may not
show up.
7. Excel
will import the ExportBase.csv file.
Summary:
Through
the use of TextAnalyst you were able to quickly tell that the document
“Databasing in the 90’s” is indeed a valuable resource. TextAnalyst’s concise summary allowed you to
grasp the key points of the text. The
Topic Structure tree led you to find the most important concepts and from there
focus your investigation. By navigating
the document you were able to retrieve sentences that answered your
questions. Exporting your findings as a
web page and in a spreadsheet file let you share the results of your work with remote
colleagues. By configuring the
connected dictionary to better match your criteria you improved and focused TextAnalyst’s
analysis.
With
this powerful arsenal of tools allowing you to quickly comprehend the meaning
of a text, without plunging into reading the full document, you can create detailed
and insightful reports.
Short
quiz:
Q1.
Is TextAnalyst a useful tool?
Q2. Is TextAnalyst user friendly?
We wish you happy
projects!
Ó 2000 Megaputer Intelligence Inc.
All rights reserved.
Ó
2000
MicroSystems, Ltd.
All rights reserved.