Flat Files

Flat files regroup all files that lies under your local filesystem. This connector can read and/or write (see writing) to such files. Currently, CSV files and XLSX files are supported.

Parameters on the datastore element are the following :

Parameters

Details

name

how you want this data store be refered as

type

should be file

location

root location for all tables of this datastore. This parameter is optionnal

CSV files

A CSV tables has the following parameters

Parameters

Details

name

how you want this table be refered as

type

should be csv

location

location of the file (or files, see pattern reference)

locationSeparator

separator for location (see pattern reference)

locale

encoding to use for the file. By default it is “UTF-8” (more information here)

compression

compression format to use. Can be none (default, no compression) or gz (for gzip compression)

delimiter

separator for cells (default tabulation, “t”)

header

is there an header (first row) : true (default) or false

newline

character to use to specify a new line (default “n”)

quote

character to use that quote cells (default “, specified by “”“)

quoteEscape

character to use before a quote character inside a cell that isn’t a quote (default ” specified by “”“)

comment

character which at the begining of a line indicate that it should be skipped

File location patterns

Each CSV table can be composed of many CSV files by using the “*” character in the location. “*” means any value

For instance the following description will import any file which is in the pattern /data/year=*/month=*/file-*.csv

<datastore name="flat" type="flat" location="/data/">
  <!-- File that will be used as source -->
  <table name="src" type="csv" location="year=*/month=*/file-*.csv">
    <column name="col1" type="text"/>
  </table>
</datastore>

Therefore :

  • /data/year=2019/month=11/file-20191101.csv : will be taken

  • /data/year=2019/month=11/day=01/file-20191101.csv : will be NOT be taken (one directory)

  • /data/year=2019/month=11/notafile.csv : will be NOT be taken (not a good file name)

Additionnaly, the location attribute at the table level can store multiple locations. Each location is separated by the pipe character (‘|’) but you can change the separator using the locationSeparator attribute.

You can mix multiple locations and the patterns.

For instance, the following example will read every csv files in the repertory /data/a/ and all csv files in the repertory /data/b/ (but not in the repertory /data/c/ for instance).

<datastore name="flat" type="flat" location="/data/">
  <!-- File that will be used as source -->
  <table name="src" type="csv" location="a/*.csv|b/*.csv">
    <column name="col1" type="text"/>
  </table>
</datastore>

Writing to CSV files

You can write to a CSV file by providing a source element to the table that describe how it must be filled.

When a CSV table is used for ouput only, it is not needed to supply the columns structure. If absent, the structure of the feeding stream will be used.

<datastore name="flat" type="file" location="/data/">
  <!-- File that will be used as source -->
  <table name="src" type="csv" location="input.csv">
    <column name="col1" type="text"/>
  </table>

  <!-- File that will contain a copy of the src table -->
  <table name="dest1" type="csv" location="output.csv">
    <!-- If the table is in a module, just change datastore by module -->
    <source type="datastore" datastore="flat" table="src"/>
  </table>

  <!-- File that will contain the output of a SQL query -->
  <table name="dest2" type="csv" location="output-sql.csv">
    <!-- queries only work on a SQL datastore -->
    <source type="datastoreQuery" datastore="dw">
      select * from my_table where col1 = 'A'
    </source>
  </table>
</datastore>

To process the datastore (all tables with a source element), you can use the following command :

ipa run-datastore flat

It will produce the output.csv and output-sql.csv files.

Excel files (XLSX only)

An Excel tables has the following parameters. You can read or write from an Excel file.

Parameters

Details

name

how you want this table be refered as

type

should be xlsx

location

location of the file (or files, works like CSV files, see pattern reference)

locationSeparator

separator for location (works like CSV files, see pattern reference)

sheet

Which sheet in the file should be used

colStart

Column index of the first column (start with A)

rowStart

Row number where data should be read/written (start at 1, header should be excluded)

Currently, there is also metadata discovery, meaning you need to specify the column names and types you want to read from the Excel file. If a header is present in the file, it shouldn’t be used (rowStart should be after the header row).

On Windows, you can’t write to an Excel file which is already open in Excel.

The following example shows a datastore named excel that acces files in the directory data. Two tables are defined

<datastore name="excel" type="file" location="./data/">
  <!-- File that will be used as source -->
  <table name="src" type="xlsx" location="input.xlsx"
    sheet="input_sheet" rowStart="2" colStart="A">
    <column name="col1" type="text"/>
    <column name="col1" type="datetime"/>
  </table>
  <!-- File that will be used as output (no need to provide columns/metadata) -->
  <table name="src" type="xlsx" location="input.xlsx"
    sheet="output_sheet" rowStart="2" colStart="A">
    <source type="datastore" datastore="excel" table="src"/>
  </table>
</datastore>