Loading Information into Vision

Reading Files

The asFileContents message can be sent to a String representing the name of a file and returns the contents of the file as a String. For example:

  "myFile" asFileContents

opens the file myFile and returns its contents as a string. Since this program is not terminated with a ';' character, the contents of the file will print by default. You could save the contents of the file into a variable. For example:

  !fileContents <- "myFile" asFileContents ;
  fileContents count printNL ;

This program stores the contents of the file myFile into the variable fileContents. The second statement in this program displays the number of characters in the file.

There are two primary reasons for reading a file into Vision:

The file contains Vision program code which you want to read and execute.
The file contains information which needs to be parsed and stored in properties of other Vision objects.

Loading Programs

It is often useful to save methods and other Vision programs in files outside the Vision database. You can interactively load these files by reading them into a Vision editor and executing the request; however, it is also useful to be able to read in files of code as part of a program. You can use the asFileContents evaluate expression to do this. For example, if the file sampleCode contains the Vision code:

  !x <- 3 ;
  !y <- 10 ;

and you execute:

  "sampleCode" asFileContents evaluate;
  x print ; y print ;

you will see the values of x and y displayed. The first statement in this program will read in the file and evaluate its contents as though you had entered the code directly. After the file contents have been evaluated, you have access to the variables x and y defined in this file.

By default, the code in your file will be evaluated as though you typed it into your top-level workspace. You can also execute code in the context of a particular object. For example, if the file sampleCode contains the Vision code:

  "This object is an instance of class " print;
  whatAmI printNL ;

and you execute:

  "sampleCode" asFileContents evaluateIn: 3 ;

You should see:

  This object is an instance of class Integer

Loading Data Files

The string returned by the asFileContents message can be viewed as a data file. Each line of the data file can be thought of as a record. Each record consists of one or more fields. The fields in the record can be in a fixed format or delimited by a specific character. In a fixed format, each field occupies the same character positions from record to record. In a delimited format, fields are separated by a known character so the position of a specific field relative to its record will vary from record to record.

For example, in a fixed field format, you could have a security id in positions 1 - 6 , a name in positions 7 - 36, and a price in positions 37 - 46 as illustrated below:

  IBM   International Business Mach      123.500
  GM    General Motors                    33.125
  XON   Exxon Corp                       111.000

The same data in a comma delimited format would look like:

  IBM, International Business Mach, 123.500
  GM, General Motors, 33.125
  XON, Exxon Corp, 111.000

In the fixed field format, each field starts at a known character position in the record and has a fixed size. Field values that require fewer characters are blank-padded to fit into the fixed field size. In the delimited format, the blanks between the fields are optional (unless the blank is the delimiter). The selected delimiter should not be a character that can be used within a field. The "tab" (ctrl I) character is often used as the delimiter for this reason.

The asLines message is used to divide the string returned by the asFileContents message into a list of strings, using the carriage-return character as a delimiter. For example:

  #-- create string containing the file name, including its path.
  !fileName <- "/users/xyz/dir1/file.txt" ;

  #--  extract contents of file as a single, large string.
  !theFile <- fileName asFileContents ;

  #--  convert the file into a list of records
  #--  each element of record is a single string
  !records <- theFile asLines ;
  records count printNL ;

The variable records returns a list of string objects where each element of the list corresponds to a record in the original file. This record can then be divided into its fields based on position or delimiter character. Each field is a string by default. The field value can be converted to other data types such as number.

The following Vision code can be used to break fixed format records into fields:

  #--  break each record into its fields - fixed format approach
  !xrecords <- records
    extendBy: [ !securityId <- ^self take: 6 . stripBoundingBlanks ;
                !name <- ^self from: 7 for: 30 . stripBoundingBlanks ;
                !price <- ^self from: 37 to: 47 . asNumber ;
              ] ;

The variable xrecords returns the list of records extended by three additional variables representing the fields in the record. The securityId variable is defined as the first 6 characters of the record. The name variable is defined as a 30 character string starting at position 7 in the record. Blanks at the start and end of these fields are removed. The price variable is defined as the field from position 37 to 47 in the record and is converted to a number.

The following Vision code can be used to break comma-delimited records into fields:

  #--  break each record into its fields - comma-delimited approach
  !xrecords <- records
     extendBy: [ !fields <- ^self breakOn: "," ;     # break on delimiter
                 !securityId <- fields at: 1 . stripBoundingBlanks ;
                 !name <- fields at: 2 . stripBoundingBlanks ;
                 !price <- fields at: 3 . asNumber ;
               ] ;

The breakOn: message is used to break the record into fields based on one or more delimiter characters and returns a list of strings. The securityId, name, and price variables are defined as the first, second, and third elements of this list.

The information in these records can be used to update properties of other objects. Field values are often used as a "key" for another structure. For example, the securityId can be used to reference a specific instance of a Security class. Assume the class Security had defined the properties name and price. #-- use each record to update name and price properties of Security xrecords do: [ !security <- ^global Named Security at: securityId ; security isntNA ifTrue: [ security :name <- name ; security :price <- price ; ] ifFalse: [ securityId print ; " Not Found." printNL ; ] ; ] ;

Additional String Parsing Techniques

There are many additional messages defined at String which are useful for parsing data files.

The asRecords message combines the asFileContents and asLines steps. The expression:

  "sampleCode" asRecords

is the same as:

  "sampleCode" asFileContents asLines

The asCSVRecords message can be used to convert a comma-separated-value format file to a list of strings extended by the message fields. Embedded commas in the original file are preserved. For example, if the file sample.csv contains:

  key1,description1,value1
  key2,"description 2 with , character",value2

then:

  "sample.csv" asCSVRecords
  do: [ "Field 1: " print ; fields at: 1 . printNL ;
        "Field 2: " print ; fields at: 2 . printNL ;
        "Field 3: " print ; fields at: 3 . printNL ;
      ] ;

displays:

  Field 1: key1
  Field 2: description1
  Field 3: value1
  Field 1: key2
  Field 2: description 2 with , character
  Field 3: value2

Several messages are available for filtering the characters from the input file. The message stripChar: is used to return the recipient string with the supplied character removed for every occurrence. The message translate:to: is used to replace all occurrences of the character in the first parameter with the string supplied in the second parameter. For example:

  "sampleCode" asFileContents
     stripChar: "+" . translate: "*" to: " " .

removes all the '+' characters from the string and changes all occurrences of the '*' character to a blank.

In some earlier examples, you have seen several techniques used for parsing the data fields and converting the values to different types. In the expression:

  !securityId <- ^self take: 6 ;

the message take: is used to extract the first 6 characters of the string. This message can be used in conjunction with the message drop: which eliminates characters from the start or end of the string. For example, if the name field starts in position 7 for 30 characters, the field could be extracted using:

  !name <- ^self drop: 6 . take: 30 ;

The messages from:for: and from:to: provides alternative ways to do the same thing. For example:

  !alt1 <- ^self from: 7 for: 30 ;
  !alt2 <- ^self from: 7 to: 36 ;

The messages from: and to: are variations on the from:to: message. When the supplied parameter is a number, from: returns the string starting at this position to the end of the string and to: returns the string from position 1 to the supplied position. The supplied parameter can also be a single character. In this case, from: returns the string starting with the first occurrence of this character and to: returns the string ending with the first occurrence of this character.

The message stripBoundingBlanks is used to eliminate blank characters from either end of a string. For example:

  !alt3 <- ^self from: 7 for: 30 . stripBoundingBlanks ;

The expression:

  !price <- ^self from: 37 to: 47 . asNumber ;

uses the message asNumber to convert the string value to a numeric value. The message convertToNumber, converts a string containing commas to a numeric value. Both of these messages return NA if the string cannot be converted to a number.

Several messages are available for converting integer to dates. The message asDate converts an integer in the form CCYYMMDD, YYMMDD, YYMM, or YY into a date where CC is the century, YY is the year, MM is the month, and DD is the day. For example:

  !priceDate <- ^self from: 48 for: 10 . asNumber asInteger asDate ;

If the field starting at position 48 for 10 characters contained " 960515 ", the variable priceDate would be set to the date May 15, 1996. Several other variations of the asDate message have been defined including: asDateFromMMDD, asDateFromMMDDYY, asDateFromMMDDYYYY, asDateFromMMYY, asDateFromYYMMDD, and asDateFromYYYYMM.

The message else: is useful for assigning default values when a field does not convert as expected. For example:

  !price <- ^self from: 37 to: 47 . asNumber else: 0.0 ;

returns the value 0.0 if the field cannot be converted to a numeric value (i.e., the asNumber message returns an NA) and

  !priceDate <- ^self from: 48 for: 10 .
     asNumber asInteger asDate else: ^today ;

returns the value ^today if the field cannot be converted to a date (i.e., the asDate message returns an NA).

Memory Resources and Advanced Data Loading Techniques

Private Memory and String Structures

The amount of private memory allocated at any point in your Vision session for temporary and permanent structures can be displayed using the following query:

  AdminTools totalNetworkAllocation

For example, if you executed the above as the first query after starting a new session you would get a baseline (per process) memory allocation, in bytes, for your Vision database. You can monitor private memory usage after loading a data file or at any point in your sequence of queries. If the amount of memory allocated exceeds what you would expect, it might be useful to review the data load code and optimize it if possible. This is particularly useful if the number of files you expect to load exceeds 3 or 4 or if you load the same file many times in your Vision session without exiting.

Temporary structures remain in memory for the duration of the Vision session if they are referenced by global variables. In the previous sections, the variable fileContents or theFile is used to save the contents of a file as a string. The subsequent conversion of this string into lines (also strings), stored as the variable records, and the parsing of each record into individual field variables, accessed by the variable xrecords and its extended properties, increases the private memory allocation as new strings are created.

If the strings and variables referencing them were created in a method block, following execution of the block these temporary structures would no longer be resident in memory provided that no references to them existed from permanent structures in the database. As properties are updated or new instances are created using these string field values, permanent references are made to these strings and the structures remain in memory.

The amount of memory used to store these permanently referenced strings depends on how the strings are created. Strings are stored in clusters or stores, as are all classes and their instances. Since the String class is a Built-in class, new strings do not get added to its default cluster. The Vision expression:

  String instanceList count

will always return a value of 1.

The variables created to reference strings actually represent the newly created string clusters. For example:

  !theFile <- "myFile" asFileContents ;
  theFile instanceList count

will return a value of 1 since the entire file is returned as a single string object. The subsequent conversion of the single string into a list of strings creates another string cluster:

  !records <- theFile asLines;

The variable records returns a List object which contains references to the new strings, and the Vision query:

  records count

returns the number of elements in the list. To access the new string cluster, you need to access any element of the list; for example:

  records at: 1 . instanceList count

will return the number of strings in the new cluster.

The size of each string cluster (number of characters) can be obtained using the Vision code:

  #-- single string
  theFile count printNL;

  #-- list of strings
  records at: 1 . instanceList total: [ count ] . printNL;

The final parsing of each string into its respective field items using a fixed field input file:

  !xrecords <- records
    extendBy: [ !securityId <- ^self take: 6 . stripBoundingBlanks ;
                !name <- ^self from: 7 for: 30 . stripBoundingBlanks ;
                !price <- ^self from: 37 to: 47 . asNumber ;
              ] ;

creates a new string cluster for each property that is not further converted into another type. Each string cluster contains the same number of strings as there are records in the input file.

For a comma-delimited input file, the final parsing of each string into its field items using the form:

  !xrecords <- records
     extendBy: [ !fields <- ^self breakOn: "," ;     # break on delimiter
                 !securityId <- fields at: 1 . stripBoundingBlanks ;
                 !name <- fields at: 2 . stripBoundingBlanks ;
                 !price <- fields at: 3 . asNumber ;
               ] ;

could potentially create references to string clusters containing more than just the strings associated with a given field. The breakOn: message breaks the string record into fields based on one or more delimiter characters and returns a list of strings. The variable fields references a new string cluster which contains all strings for all records. The securityId and name variables would reference the same string cluster as fields if there were no blanks that needed to be stripped since the original string would have been returned by the method. A new string cluster would have been created if blanks were indeed stripped, containing just the strings for the single field item.

Subsequent use of these strings to update or create permanent structures:

  xrecords
  do: [ !security <- ^global Named Security at: securityId ;
         security isNA
           ifTrue: [ :security <- ^global Security createInstance: securityId ] ;
         security name != name
           ifTrue: [ security :name <- name ] ;
         security :price <- price ;
       ] ;

results in permanent references to string clusters that can contain minimally string values for a given field or in the worst case, all strings in the original input file.

A simple technique to disassociate the strings representing field values from the cluster referenced by the field item is to create a copy of each string at the point where it is to be used to update a property in a permanent object. The above code would need to be revised as follows:

  xrecords
  do: [ !security <- ^global Named Security at: securityId ;
         security isNA
           ifTrue: [ :security <- ^global Security
                       createInstance: securityId copyString #<--- update code
                   ] ;
         security name != name
           ifTrue: [ security :name <- name copyString ] ;   #<--- update name
         security :price <- price ;
       ] ;

The resultant string cluster representing a property value would only include those strings which were actually used to update the property. For example, if a data file contained 2,000 records with 2 string fields and if only 20 new securities were created while 40 existing securities had their name refreshed, the new string cluster created for the code property value would have only 20 strings and the new string cluster for the name property value would have 60 strings. Without creating a copy of the strings as needed, both the code and name string clusters would minimally contain 2,000 strings using the fixed field format parser. For the comma-delimited format parser, the two properties could potentially reference a single cluster containing 4,000 strings.

For installations that may not yet have the method copyString available, the following code needs to be executed at the start of an interactive session or committed permanently in the database:

  String
  defineMethod: [ | copyString |
    ^self asSelf drop: 0
  ] ;

Messages that manipulate string content, such as take:, drop:, concat:, all return a modified copy of the recipient string. In this case, nothing is changed in the recipient string, only a new copy of the string is returned.

Another technique to further decrease the amount of memory needed to load data would be to omit creating references to intermediate string clusters:

  !xrecords <- "myFile" asFileContents asLines
    extendBy: [ . . .
                . . .
              ] ;

In the above revision, the intermediate reference to the single string representing the data file does not remain and only the string cluster created by asLines and referenced by xrecords remains in memory as long as the Vision session is active.

As mentioned earlier, globally referenced string clusters remain in private memory. To optimize private memory usage when loading data, methods should be defined to process the data load so that following execution any unreferenced temporary variables or string clusters are no longer resident in memory.

Batch Loads

The message asFileContents is suitable for loading data from files that are relatively small, on the order of 5-30 MB, consisting of a single file or at most several files. As the size of the input file increases or the number of files needed to complete the data load increase, memory resources (usually swap) become a limiting factor and the load may not always succeed.

For single files greater than 30 MB, memory resources can be optimized if the file is read and processed in batches rather than in its entirety. The technique described below requires a fixed field format and uses protocol of the Open Vision ToolKit.

To access a file with read-only permission, use the message asOpenVisionChannel and supply file as the type and the file name as the resource in its specification string. Choose the appropriate trim format option, which controls the removal of leading and trailing blanks from strings.

  #-- create a string containing the file name
  !fileName <- "/users/xyz/dir1/file.txt" ;

  #-- open a channel to the file and set a trim format
  !file <- "file:" concat: fileName .
       asOpenVisionChannel
       setTrimFormatToTrailingBlanks ;

To process the file in batches, you would supply a batch size (number of records to process at a time) and calculate the file size, the record size, the number of records, and the number of batches to process. The following messages are sent to the channel object that is stored in the variable file.

The message byteCount returns the size of the file, the message getLine returns a string obtained by reading the next line from the channel, the message getString: size at: anOffset returns a string containing the number of characters specified by the size argument beginning at the offset position in the file, and the message close closes the file.

  #-- get the file size
  !fileSize <- file byteCount ;

  #-- calculate the record size and number of records
  !recSize <- file getLine count ;
  !records <- (fileSize / recSize) asInteger ;

  #-- specify a batch (no. of records)
  !batchCount <- 1000 ;

  #-- calculate the batch size and the number of batches
  !batchSize <- batchCount * recSize ;
  !batches <- (records / batchCount + 1) asInteger ;

The Vision code below illustrates the iterative process of reading in a batch of records, specified above, as a single string and parsing it using techniques described earlier in this document:

  #-- batch numbers start with 0
  !firstBatch <- 0;
  !lastBatch <- (batches - 1) asInteger;

  #-- iterate over the batches
  batches sequence0
  iterate:
    [ !batchNum <- ^self;

      #-- get offset for current batch
      !offset <- (batchNum * ^my batchSize + 1) asInteger;

      #-- get size of current batch
      !bytesLeft <- ^my fileSize - offset + 1;
      !size <- ^my batchSize min: bytesLeft . max: 0 . asInteger ;

      #-- load current batch
      !string <- ^my file getString: size at: offset ;

      string count > 0 ifTrue:
       [ #-- parse records in current batch
         !xrecords <- string asLines
           extendBy:
             [ !id <- ^self take: 6 . stripBoundingBlanks ;
               !name <- ^self from: 7 for: 30 . stripBoundingBlanks;
                 . . .
             ];
         #-- process records in current batch
         xrecords do:
           [ !security <- ^global Named Security at: id ;
             security isntNA ifTrue:
               [ security :name <- name copyString;
                  . . .
               ];
           ];
       ];
    ];  ## end iterate

  #-- close the file
  file close ;

The batch sequence count begins with 0 in order to get the correct offset into the file for the first iteration. The method sequence0 when sent to an integer returns a list of integers beginning with 0 and ending with 1 less than the recipient integer. Each iteration returns a string of the expected size (batchSize) and the string is converted into a list of records using the asLines method. The batch size is bounded by the amount of characters left to process since the last batch is not guaranteed to have the full number of records as specified by the value of batchCount. A final check of string size greater than 0 is added in the event that an empty string is returned.

Multiple File Loads

If data to be loaded is provided in separate files rather than in a single file, the batch load technique may not be appropriate. In this case, the standard load based on asFileContents can be further optimized with respect to swap usage as described in the example below.

Assume the class Security has a time series property and a method to load monthly scores from a data file. The data file provides the id, name and score for a universe of securities for a given date in comma delimited format. The method creates new Security instances if they do not exist in order to store the full score history. In this example, data files to be processed range in size from 1 to 2 MB and span dates from 8712 to 9709. The data is loaded using the following Vision code:

  Security loadMonthlyScoresAsOf: 8712 fromFile: "score.8712" ;
  Security loadMonthlyScoresAsOf: 8801 fromFile: "score.8801" ;
   .
   .
   .
  Security loadMonthlyScoresAsOf: 9708 fromFile: "score.9708" ;
  Security loadMonthlyScoresAsOf: 9709 fromFile: "score.9709" ;

or equivalently:

  !start <- 8801 asDate;
  !end <- 9709 asDate;
  !date <- start;

  [ date <= end ]
  whileTrue:
    [ !file <- "score." concat: (date asInteger drop: 2 . take: 4);
      Security loadMonthlyScoresAsOf: date fromFile: file ;
     :date <- date + 1 monthEnds;
    ];

Depending on the available memory resources at the time of execution, the serial load of the files might not complete and will use up a large amount of swap if executed in a single Vision query as shown above.

To optimize swap usage, force query execution at regular intervals either by adding ?g or by pressing the F2 key. Although both load styles above are serial, using the whileTrue: block forces all serial loads to be processed in a single query; consequently, this technique should not be used for processing multiple file loads.

The most optimized multiple file load forces query execution after each file is processed; there is no overhead in processing time:

  Security loadMonthlyScoresAsOf: 8712 fromFile: "score.8712" ;
  ?g
  Security loadMonthlyScoresAsOf: 8801 fromFile: "score.8801" ;
  ?g
   .
   .
  Security loadMonthlyScoresAsOf: 9708 fromFile: "score.9708" ;
  ?g
  Security loadMonthlyScoresAsOf: 9709 fromFile: "score.9709" ;
  ?g

A further optimization would be to reverse the load order, i.e. start with the largest file (dated 9707). However, the amount of additional savings in swap is not of the same order of magnitude; therefore, this step can be omitted if data loads cannot be processed in any random order.