Document¶
Document objects are files that are processed using Chinese Text Analyser’s segmentation engine, and have the following functions:
Document( filename, options )¶
Returns a Document object for the file specified by filename
. The parameter
options
is an optional table argument that can contain the following keys:
- process (boolean) - whether or not to process the file using Chinese Text Analyser before
returning. Defaults to
true
.
Example¶
1 2 3 4 5 6 7 8 9 | local cta = require 'cta'
-- Open a document and return after Chinese Text Analyser has completed processing
local document1 = cta.Document( 'file1.txt' )
-- Open a document and return without processing it.
-- Statistics and word lists will not be available unless you call
-- Document:process() or Document:startProcessing().
local document2 = cta.Document( 'file2.txt', { process = false } )
|
Normally you will want to access document statistics and word lists so it is preferable to let Chinese Text Analyser process the document.
If you are sure that your script does not need this information (perhaps you only want to search a document for some text, or only want to print out document sentences) then it will be slightly faster if you do not get Chinese Text Analyser to process the document first.
Document:hasFinishedProcessing()¶
Returns true
if the document has finished processing and false
otherwise.
If a document has finished processing, then statistics and word lists information will be available for the document. See Document() for more information.
Example¶
1 2 3 4 5 6 7 8 9 10 11 12 13 14 | local function hasFinished( document )
if document:hasFinishedProcessing() then
print( document:name() .. ' has finished processing' )
else
print( document:name() .. ' has not finished processing' )
end
end
local cta = require 'cta'
local document1 = cta.Document( 'file1.txt' )
local document2 = cta.Document( 'file2.txt', { process = false } )
hasFinished( document1 )
hasFinished( document2 )
|
Output¶
file1.txt has finished processing
file2.txt has not finished processing
Document:process()¶
Starts processing a document and waits until processing has finished before returning. If the document has already been processed this function does nothing.
You don’t normally need to call this function unless you created the document
with the process
parameter set to false
. See Document() for more information.
Example¶
1 2 3 | local cta = require 'cta'
local document = cta.Document( 'file.txt', { process = false } )
document:process()
|
Document:startProcessing()¶
Starts processing a document and returns immediately even if the document hasn’t finished processing. If the document has already been processed this function does nothing.
This function is useful if you want to process a large file, but there is other work you can do in your script first before you need access to the document statistics or word lists. All document processing occurs in a background thread, so you can do that other work while waiting for the document to finish processing.
Example¶
1 2 3 4 5 6 | local cta = require 'cta'
local document = cta.Document( 'file.txt', { process = false } )
document:startProcessing()
...
-- do work --
...
|
Document:waitUntilProcessed()¶
Waits until the document has finished processing before returning. Returns immediately if the document has already been processed.
This function is useful if you have called Document:startProcessing(), then performed some work, and then want to ensure that the document has finished processing before continuing with some other work.
Example¶
1 2 3 4 5 6 7 8 9 | local cta = require 'cta'
local document = cta.Document( 'file.txt', { process = false } )
document:startProcessing()
-- do work --
document:waitUntilProcessed()
-- do more work --
|
Document:name()¶
Returns the filename of the document.
Example¶
1 2 3 4 | local cta = require 'cta'
local document = cta.Document( 'file.txt' )
print( document:name() )
|
Output¶
file.txt
Document:tostring()¶
Calls Document:name()
Example¶
1 2 3 4 5 | local cta = require 'cta'
local document = cta.Document( 'file.txt' )
-- print will call tostring() on document
print( document )
|
Document:lines( includeNewlines )¶
Returns an iterator that iterates over all lines in the document. Each element of the iteration will be a Text object.
The includeNewlines
parameter is optional and defaults to false
if not
specified.
If includeNewlines
is true then the newline character \n
is included at the
end of each line.
Example¶
1 2 3 4 5 6 | local cta = require 'cta'
local document = cta.Document( 'file.txt' )
for line in document:lines() do
print( line )
end
|
Document:allWords()¶
Returns a WordList object containing all the unique words in the document.
An error will occur if this function is called before the document has finished processing.
Example¶
1 2 3 4 5 6 7 | local cta = require 'cta'
local document = cta.Document( 'file.txt' )
local wordlist = document:allWords()
for word in wordlist:words() do
print( word )
end
|
Document:knownWords( wordList )¶
Returns a WordList object containing all the unique words
that exist in both the document and the wordList
parameter. The
wordList
parameter is optional and defaults to to
cta.knownWords() if not specified.
An error will occur if this function is called before the document has finished processing.
Example¶
1 2 3 4 5 6 7 8 9 | local cta = require 'cta'
local document = cta.Document( 'file.txt' )
local known = document:knownWords()
...
local customList = cta.WordList( 'words.txt' )
known = document:knownWords( customList )
...
|
Document:unknownWords( wordList )¶
Returns a WordList object containing all the unique words
that exist in the document and that don’t exist in the wordList
parameter.
The ‘wordList’ parameter is optional and defaults to to
cta.knownWords() if not specified.
An error will occur if this function is called before the document has finished processing.
Example¶
1 2 3 4 5 6 7 8 9 | local cta = require 'cta'
local document = cta.Document( 'file.txt' )
local unknown = document:unknownWords()
...
local customList = cta.WordList( 'words.txt' )
unknown = document:unknownWords( customList )
...
|
Document:allStatistics( options )¶
Returns a table containing frequency and other statistics for each unique word in the document.
Each element of the returned table will include a table with the following fields:
- word - the word (note: only if options.keyByWord is
false
). - frequency - the number of times the word appeared in the document.
- percentageFrequency - the number of times the word appeared in the document as a percentage of the total words in the document.
- cumulativePercentageFrequency - the cumulative percentage frequency of the word.
- firstOccurrence - the first occurrence of the word in the document specified as a byte offset from the beginning of the file.
- hskLevel - the lowest HSK level that this word appears in. If the word does not appear in any HSK level, this value will be set to 999.
This function takes an optional table containing configuration parameters which can have the following keys:
options.keyByWord - (boolean)
If keyByWord is
false
the returned table is an array, sorted by the other paramters specified in options.If keyByWord is
true
the first return value is a table keyed by the word. A second table is also returned containing a sorted array that can be used to process the first table in sorted order. Ifsorted
isfalse
(see below), no second table is returned.Defaults to
false
.It is useful to us keyByWord when you want to be able to easily access statistics for a specific word e.g.
1 2 local stats = document:allStatistics( { keyByWord = true } ) local wordStats = stats['资料']
options.sortBy - (string)
The field to sort by. Valid values are:
- frequency
- firstOccurrence
- word
- hskLevel
Defaults to
frequency
options.sorted - (boolean)
If
true
the returned table(s) will be sorted.If
false
the returned table will not be sorted in any particular order. This is useful when you want to keyByWord and you are not interested in having sorted results.Defaults to
true
.This value is only useful if keyByWord is
true
, in which case the second table containing the sort order will not be returned and the values in the first table will not be in any particular order.It is marginally faster (and more memory efficient) to set
sorted
tofalse
if you are keying by word and aren’t interested in any particular sort order.
options.ascending - (boolean)
Whether to sort in ascending or descending order.
The default value depends on the sortBy field as follows:
- frequency -
false
- firstOccurrence -
true
- word -
true
- hskLevel -
true
An error will occur if this function is called before the document has finished processing.
Example¶
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 | local cta = require 'cta'
local function printRow( key, stats )
cta.write( key )
if stats.word ~= nil then
cta.write( '', stats.word )
end
cta.print( '', stats.hskLevel,
stats.firstOccurrence,
stats.frequency,
stats.percentageFrequency,
stats.cumulativePercentageFrequency )
end
local function printKeyedByRow( statistics )
for i, stats in ipairs( statistics ) do
printRow( i, stats )
end
end
local function printKeyedByWord( statistics, sorted )
-- the order of associative arrays is not guaranted, and so
-- if we have a valid 'sorted' parameter, use it to iterate
-- over the statistics in the correct order
if sorted ~= nil then
for _, word in ipairs( sorted ) do
printRow( word, statistics[word] )
end
else
-- print unsorted
for word, values in pairs( statistics ) do
printRow( word, values )
end
end
end
local document = cta.Document( 'file.txt' )
-- Using all defaults, statistics is sorted by
-- frequency descending
local statistics = document:allStatistics()
printKeyedByRow( statistics )
-- Sort statistics by hskLevel descending
statistics = document:allStatistics( { sortBy = 'hskLevel', ascending = false } )
printKeyedByRow( statistics )
-- Sort by frequency descending (the default), key by word.
statistics, sortOrder = document:allStatistics( { keyByWord = true } )
printKeyedByWord( statistics, sortOrder )
-- Key by word, don't care about sort order.
statistics = document:allStatistics( { keyByWord = true, sorted = false } )
printKeyedByWord( statistics, nil )
|
Document:knownStatistics( options )¶
Returns a table containing frequency and other statistics for each unique known word in the document.
The options
parameter is the same as for Document:allStatistics() except that it can also take one additional field:
options.wordList - (WordList)
Only words that exist in this WordList will be treated asknown
. If not specified this value defaults to cta.knownWords().
An error will occur if this function is called before the document has finished processing.
Example¶
1 2 3 4 5 6 7 8 9 10 11 | local cta = require 'cta'
local document = cta.Document( 'file.txt' )
local wordList = cta.WordList( 'words.txt' )
-- get 'known' word statistics based on cta.knownWords()
local statistics = document:knownStatistics()
...
-- get 'known' word statistics based on wordlist
statistics = document:knownStatistics( { wordList = wordList } )
...
|
Document:unknownStatistics( options )¶
Returns a table containing frequency and other statistics for each unique unknown word in the document.
The options
parameter is the same as for Document:allStatistics() except that it can also take one additional field:
options.wordList - (WordList)
Only words that do not exist in this WordList will be treated asunknown
. If not specified this value defaults to cta.knownWords().
An error will occur if this function is called before the document has finished processing.
Example¶
1 2 3 4 5 6 7 8 9 10 11 | local cta = require 'cta'
local document = cta.Document( 'file.txt' )
local wordList = cta.WordList( 'words.txt' )
-- get 'unknown' word statistics for all words *not* in cta.knownWords()
local statistics = document:unknownStatistics()
...
-- get 'unknown' word statistics for all words *not* in wordList
statistics = document:unknownStatistics( { wordList = wordList } )
...
|
Document:findWord( word )¶
Returns an iterator that finds all lines and sentences in the document
containing the word (or words) specified by the word
parameter.
The word
parameter can be one of:
- string - a single word, e.g.
'搜索'
- table - an array of words, e.g.
{ '第一', '第二', '第三' }
- WordList - a WordList, e.g. as returned from Document:unknownWords() or cta.WordList()
Each iteration will return three values:
Example¶
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 | local function find( document, words )
for word, sentence, line in document:findWord( words ) do
print( word )
print( '', sentence )
print( '', line )
end
end
local cta = require 'cta'
local document = cta.Document( 'file.txt' )
-- find all instances of a single word
find( document, '搜索' )
-- find all instances of a multiple words
find( document, { '第一', '第二', '第三' } )
-- find all instances of a unknown words
find( document, document:unknownWords() )
|
Document:findLinesContaining( word )¶
Similar to Document:findWord() except each iteration only returns the word and the line.
Example¶
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 | local function findLines( document, words )
for word, line in document:findLinesContaining( words ) do
print( word )
print( '', line )
end
end
local cta = require 'cta'
local document = cta.Document( 'file.txt' )
-- find all lines containing a single word
findLines( document, '搜索' )
-- find all lines containing any one of multiple words
findLines( document, { '第一', '第二', '第三' } )
-- find all lines containing unknown words
findLines( document, document:unknownWords() )
|
Document:findSentencesContaining( word )¶
Similar to Document:findWord() except each iteration only returns the word and the sentence.
Example¶
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 | local function findSentences( document, words )
for word, sentence in document:findSentencesContaining( words ) do
print( word )
print( '', sentence )
end
end
local cta = require 'cta'
local document = cta.Document( 'file.txt' )
-- find all sentences containing a single word
findSentences( document, '搜索' )
-- find all sentences containing any one of multiple words
findSentences( document, { '第一', '第二', '第三' } )
-- find all sentences containing unknown words
findSentences( document, document:unknownWords() )
|