Identifying and Splitting Records

In order to accurately identify and separate all records in nontabular data, the following needs to be established:

What is the start_line_pattern (regular expression) that signifies the start of a record?
Is there is a end_line_pattern (regular expression) that signifies the end of a record?
Does the file start_in_a_record? true or false.

Is there is a start_line_pattern present for the first record?
Does the file end_in_a_record? true or false.

Is there is an end_line_pattern present for the last record?
Do you need to capture_start_line? true or false

Does the start_line_pattern contain data that you would like the capture?

Below is the starting point for any nontabular YAML mapping, where the above can be defined:

---
non_tabular_row:
  start_line_pattern: !ruby/regexp //
  end_line_pattern: !ruby/regexp //
  start_in_a_record:
  end_in_a_record:
  capture_start_line:

Common non_tabular_row mapping examples:

Example 1:

For data that contains both a start_line_pattern and end_line_pattern for every record, and where the line being used to signify the start_line_pattern doesn’t contain any data that requires capturing, the non_tabular_row mapping would be the following:

---
non_tabular_row:
  start_line_pattern: !ruby/regexp /\A-{6}\z/
  end_line_pattern: !ruby/regexp /\A\+{6}\z/
  capture_start_line: false
  start_in_a_record: false
  end_in_a_record: false

Here example regular expression (regexp) patterns for start_line_pattern and end_line_pattern have been defined. The start_line_pattern of each record does not need to be captured, so capture_start_line is set to false. The data does not start_in_a_record or end_in_a_record, so both are also set to false.

Given that capture_start_line, start_in_a_record and end_in_record are all set the false, they can be removed from the mapping as this is their default behavior, therefore the non_tabular_row mapping could look like the following:

---
non_tabular_row:
  start_line_pattern: !ruby/regexp /\A-{6}\z/
  end_line_pattern: !ruby/regexp /\A\+{6}\z/

Example 2:

For data that contains a start_line_pattern for every record but no end_line_pattern, and where the start_line_pattern does not need capturing, the non_tabular_row mapping would be the following:

---
non_tabular_row:
  start_line_pattern: !ruby/regexp /\A-{6}\z/
  end_in_a_record: true

The start_line_pattern has been defined, there is no end_line_pattern to be defined so that part of the mapping can be removed. The data does not start_in_a_record as there is a start_line_pattern before the first record, as a result capture_start_line is false. Finally, end_in_a_record is true as there is no end_line_pattern to signify the end of the last record.

GOTCHA: If end_in_a_record is left undefined, the last record will not be identified as end_in_a_record defaults to false

Example 3:

For data that contains a line which separates records, but does not appear before the first record or after the last record and where the separating line does not need to be captured, the non_tabular_row mapping would look like the following:

---
non_tabular_row:
  start_line_pattern: !ruby/regexp /\A={8}\z/
  start_in_a_record: true
  end_in_a_record: true

Here we treat the separating line as the start_line_pattern, and set start_in_a_record to true as there is no start_line_pattern before the first record. Similarly, we set end_in_a_record to true as there is no end_line_pattern after the last record. There is no end_line_pattern to be defined so that part of the mapping can be removed, along with capture_start_line as it is set to false.

GOTCHA: If only a start_line_pattern is defined, with start_in_a_record and end_in_a_record left undefined, neither the first and last record would be captured.

Example 4:

For data that contains a start_line_pattern for every record but no end_line_pattern and where the start line does require capturing because it contains data, the non_tabular_row mapping would be the following:

---
non_tabular_row:
  start_line_pattern: !ruby/regexp /\AH,\d{8}-/
  capture_start_line: true
  start_in_a_record: false
  end_in_a_record: true

Here the start_line_pattern has been defined, capture_start_line has been set to true, as a result start_in_a_record needs to be set to false. Finally, end_in_a_record is set to true as there is no end_line_pattern after the last record.

GOTCHA: part 1 If capture_start_line is left undefined, data from the start line of each record cannot be captured as capture_start_line is false by default.

GOTCHA: part 2 If end_in_a_record is left undefined, the last record will not be identified as end_in_a_record is false by default.

Notes:

The non_tabular_row mapping can be written verbosely if you find this improves the readability. For instance, example 3 could be written as follows:

---
non_tabular_row:
  start_line_pattern: !ruby/regexp /\A={8}\z/
 # no end_line_pattern present
 # end_line_pattern: !ruby/regexp //
  capture_start_line: false
  start_in_a_record: true
  end_in_a_record: true

Removing Lines:

Lines of data can be removed/ignored if there is data that you would not like to be captured. For example, a header and/or footer on each page that doesn’t align with the start and end of each record.

Lines to be removed can be matched either as a string comparison or by using a regexp. This also makes up part of the non_tabular_row mapping.

Example:

---
non_tabular_row:
#  start_in_a_record: false
#  end_in_record: false
  capture_start_line: false
  start_line_pattern: !ruby/regexp /\A-{6}\z/
  end_line_pattern: !ruby/regexp /\A\+{6}\z/
  remove_lines:
    header:
    - NYCRIS
    - !ruby/regexp /Page \d+/
    footer:
    - PDF by PDFmaker

The keys (header: and footer: in the above example) can be labelled as anything, but labeling them as something descriptive is advised. Each of them containing an array of lines to be removed.

GOTCHA: Lines will only be removed if every line in a given array is matched in the order they are defined in the mapping.