DuckDB - Loading MOT test data · Because software is complicated enough

Table of Contents

First step was downloading the data and reviewing the file contents looked reasonable. Next is to load these into DuckDB.

Initialise a new DuckDB to store the data…

duckdb mot_data.duckdb

And now to load each year’s data. Given some years have different formats in delimiter and filename structure this implies that it is not a consistent automated process to generate these files. Chances are there will be exceptions on import.

I’ll try a quick bulk import to test this, any issues it’ll be easier to go year by year to resolve them as we only have a limited number of files to deal with.

Bulk load #

$ duckdb mot_data.duckdb
v1.2.0 5f5512b827
Enter ".help" for usage hints.
D CREATE TABLE test_result AS SELECT * FROM read_csv('test_result_20*', delim = '|', header = true);
 41% ▕████████████████████████▌                                   ▏ Conversion Error:
CSV Error on Line: 34549797
Original Line: 233955117|1464922263|2015-11-20|4|NT|PRS|49828|UB|TOYOTA|Estima |||SILVER|PE|3000|2005-10-17
Error when converting column "cylinder_capacity". Could not convert string "SILVER" to 'BIGINT'

Column cylinder_capacity is being converted as type BIGINT
This type was auto-detected from the CSV file.
Possible solutions:
* Override the type for this column manually by setting the type explicitly, e.g. types={'cylinder_capacity': 'VARCHAR'}
* Set the sample size to a larger value to enable the auto-detection to scan more values, e.g. sample_size=-1
* Use a COPY statement to automatically derive types from an existing table.

  file = test_result_2015.txt
  delimiter = | (Set By User)
  quote = \0 (Auto-Detected)
  escape = \0 (Auto-Detected)
  new_line = \n (Auto-Detected)
  header = true (Set By User)
  skip_rows = 0 (Auto-Detected)
  comment = \0 (Auto-Detected)
  strict_mode = true (Auto-Detected)
  date_format =  (Auto-Detected)
  timestamp_format =  (Auto-Detected)
  null_padding = 0
  sample_size = 20480
  ignore_errors = false
  all_varchar = 0


D

As suspected there are data issues within the csv files. With 41% imported it’s not clear which files have been successfully imported and rather than restarting this process over and over again as potentially further issues are hit.

I’ll go year by year as there is only a limited number.

2023 #

Loading the first csv file by making use of the create table functionality of DuckDB where it will analyse the csv file and create appropriate columns. I say csv file, as the extension of the file is .csv, but they are not ‘comma seperated’, the delimiter is a pipe so this is passed to the read_csv function.

D CREATE TABLE test_result AS SELECT * FROM read_csv('test_result_2023.csv', delim = '|', header = true);
100% ▕████████████████████████████████████████████████████████████▏ 
D desc test_result;
┌───────────────────┬─────────────┬─────────┬─────────┬─────────┬─────────┐
│    column_name    │ column_type │  null   │   key   │ default │  extra  │
│      varchar      │   varchar   │ varchar │ varchar │ varchar │ varchar │
├───────────────────┼─────────────┼─────────┼─────────┼─────────┼─────────┤
│ test_id           │ BIGINT      │ YES     │ NULL    │ NULL    │ NULL    │
│ vehicle_id        │ BIGINT      │ YES     │ NULL    │ NULL    │ NULL    │
│ test_date         │ DATE        │ YES     │ NULL    │ NULL    │ NULL    │
│ test_class_id     │ BIGINT      │ YES     │ NULL    │ NULL    │ NULL    │
│ test_type         │ VARCHAR     │ YES     │ NULL    │ NULL    │ NULL    │
│ test_result       │ VARCHAR     │ YES     │ NULL    │ NULL    │ NULL    │
│ test_mileage      │ BIGINT      │ YES     │ NULL    │ NULL    │ NULL    │
│ postcode_area     │ VARCHAR     │ YES     │ NULL    │ NULL    │ NULL    │
│ make              │ VARCHAR     │ YES     │ NULL    │ NULL    │ NULL    │
│ model             │ VARCHAR     │ YES     │ NULL    │ NULL    │ NULL    │
│ colour            │ VARCHAR     │ YES     │ NULL    │ NULL    │ NULL    │
│ fuel_type         │ VARCHAR     │ YES     │ NULL    │ NULL    │ NULL    │
│ cylinder_capacity │ BIGINT      │ YES     │ NULL    │ NULL    │ NULL    │
│ first_use_date    │ DATE        │ YES     │ NULL    │ NULL    │ NULL    │
├───────────────────┴─────────────┴─────────┴─────────┴─────────┴─────────┤
│ 14 rows                                                       6 columns │
└─────────────────────────────────────────────────────────────────────────┘
D select count(*) from test_result;
┌─────────────────┐
│  count_star()   │
│      int64      │
├─────────────────┤
│    42216721     │
│ (42.22 million) │
└─────────────────┘

All column types are correctly identified by the import and we have 42216721 rows which ties up exactly with the line check of the csv file when getting the data.

DuckDB has a handy summarize command that gives an overview of the table…

D summarize table test_result;
100% ▕████████████████████████████████████████████████████████████▏ 
┌───────────────────┬─────────────┬──────────────────────┬────────────┬───────────────┬───┬────────────┬────────────┬────────────┬──────────┬─────────────────┐
│    column_name    │ column_type │         min          │    max     │ approx_unique │ … │    q25     │    q50     │    q75     │  count   │ null_percentage │
│      varchar      │   varchar   │       varchar        │  varchar   │     int64     │   │  varchar   │  varchar   │  varchar   │  int64   │  decimal(9,2)   │
├───────────────────┼─────────────┼──────────────────────┼────────────┼───────────────┼───┼────────────┼────────────┼────────────┼──────────┼─────────────────┤
│ test_id           │ BIGINT      │ 61                   │ 1999999971 │      44923817 │ … │ 500723239  │ 1000190436 │ 1499799503 │ 42216721 │            0.00 │
│ vehicle_id        │ BIGINT      │ 22                   │ 1499999986 │      34933491 │ … │ 375354122  │ 750248245  │ 1124801418 │ 42216721 │            0.00 │
│ test_date         │ DATE        │ 2023-01-01           │ 2023-12-31 │           414 │ … │ 2023-03-25 │ 2023-07-08 │ 2023-10-03 │ 42216721 │            0.00 │
│ test_class_id     │ BIGINT      │ 1                    │ 7          │             6 │ … │ 4          │ 4          │ 4          │ 42216721 │            0.00 │
│ test_type         │ VARCHAR     │ EI                   │ RT         │             4 │ … │ NULL       │ NULL       │ NULL       │ 42216721 │            0.00 │
│ test_result       │ VARCHAR     │ ABA                  │ PRS        │             6 │ … │ NULL       │ NULL       │ NULL       │ 42216721 │            0.00 │
│ test_mileage      │ BIGINT      │ 1                    │ 999999     │        350471 │ … │ 39523      │ 67372      │ 102056     │ 42216721 │            0.77 │
│ postcode_area     │ VARCHAR     │ AB                   │ ZE         │           113 │ … │ NULL       │ NULL       │ NULL       │ 42216721 │            0.00 │
│ make              │ VARCHAR     │ \tGENERAL MOTORS     │ wrightbus  │          9622 │ … │ NULL       │ NULL       │ NULL       │ 42216721 │            0.00 │
│ model             │ VARCHAR     │ \tC63 AMG SPEEDSHI…  │ zr1        │         55499 │ … │ NULL       │ NULL       │ NULL       │ 42216721 │            0.00 │
│ colour            │ VARCHAR     │ BEIGE                │ YELLOW     │            22 │ … │ NULL       │ NULL       │ NULL       │ 42216721 │            0.00 │
│ fuel_type         │ VARCHAR     │ CN                   │ ST         │            16 │ … │ NULL       │ NULL       │ NULL       │ 42216721 │            0.00 │
│ cylinder_capacity │ BIGINT      │ 0                    │ 99999      │          4673 │ … │ 1325       │ 1597       │ 1995       │ 42216721 │            0.66 │
│ first_use_date    │ DATE        │ 0094-05-01           │ 2023-12-28 │         21650 │ … │ 2009-09-15 │ 2014-01-20 │ 2017-04-11 │ 42216721 │            0.00 │
├───────────────────┴─────────────┴──────────────────────┴────────────┴───────────────┴───┴────────────┴────────────┴────────────┴──────────┴─────────────────┤
│ 14 rows                                                                                                                               12 columns (10 shown) │
└─────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────┘

It’s a really quick way to have an overview of a table and straight away I can see the Make & Model columns will need some tidying up once the data is fully loaded.

On to the next year…

2022 #

Now the table is in existence the COPY command is used to load further data into this table.

D COPY test_result FROM 'test_result_2022.csv' (delim '|', header true);
100% ▕████████████████████████████████████████████████████████████▏ 
D

Checking the number of rows imported…

D select date_trunc('year',test_date), count(*) from test_result group by date_trunc('year',test_date);
┌───────────────────────────────┬──────────────┐
│ date_trunc('year', test_date) │ count_star() │
│             date              │    int64     │
├───────────────────────────────┼──────────────┤
│ 2022-01-01                    │     41632878 │
│ 2023-01-01                    │     42216721 │
└───────────────────────────────┴──────────────┘

Again number of rows ties up with line count of the csv file.

2021 #

Multiple files for 2021, these are comma seperated, so no need to specify the separator. Will also load them altogether as they are for the same year.

D COPY test_result FROM 'test_result_2021/test_result_2021_*.csv' (header true);
Invalid Input Error:
CSV Error on Line: 109490
Original Line: 1768675799,951494484,"2021-01-13","4","NT","P","93684","WR","LAND ROVER","SERIES 3 88""","BLUE","PE","2286","1983-03-15"
Value with unterminated quote found.

Possible fixes:
* Disable the parser's strict mode (strict_mode=false) to allow reading rows that do not comply with the CSV standard.
* Enable ignore errors (ignore_errors=true) to skip this row
* Set quote to empty or to a different value (e.g., quote='')

  file = test_result_2021/test_result_2021_32355.csv
  delimiter = , (Auto-Detected)
  quote = " (Auto-Detected)
  escape = \ (Auto-Detected)
  new_line = \n (Auto-Detected)
  header = true (Set By User)
  skip_rows = 0 (Auto-Detected)
  comment = \0 (Auto-Detected)
  strict_mode = true (Auto-Detected)
  date_format =  (Auto-Detected)
  timestamp_format =  (Auto-Detected)
  null_padding = 0
  sample_size = 20480
  ignore_errors = false
  all_varchar = 0


D

OK, so this file is making uses of double double-quotes to escape the double quote "SERIES 3 88""". This is throwing off the DuckDB csv importer.

The RFC for the csv ‘format’ (it is very loose as a format) https://www.ietf.org/rfc/rfc4180.txt does detail double double-quotes as the escaping method…

   7.  If double-quotes are used to enclose fields, then a double-quote
       appearing inside a field must be escaped by preceding it with
       another double quote.  For example:

       "aaa","b""bb","ccc"

Having a dig around, this has been identified as an issue with DuckDB, ironically, only last week https://github.com/duckdb/duckdb/issues/16476 . So for now I will need to find a workaround.

Initially thought about using sed to escape the double double-quotes into ", this is actually quite tricky to not affect empty strings "" and many other edge cases, the sed I came up with is as follows…


sed -E \
  -e 's/(^|,)""(,|$)/\1__EMPTYFIELD__\2/g' \
  -e 's/""/\\"/g' \
  -e 's/__EMPTYFIELD__/""/g' inputfile > outputfile

It changes empty fields into a unique string (EMPTYFIELD) processes the double double-quotes in the rest of the populated strings and then replaces the empty strings.

Then thought of an approach where if there are no escaped characters using \ maybe can just pass " as the escape character to DuckDB. A quick check…

simon@NUC:~/Documents/mot_data$ grep '\\' test_result_2021/test_result_2021_323*.csv
test_result_2021/test_result_2021_32357.csv:412809221,322514882,"2021-09-30","2","NT","P","63333","TA","BMW","R60\3","BLACK","PE","599","1961-11-25"
test_result_2021/test_result_2021_32361.csv:1687172195,106083562,"2021-06-26","2","NT","P","","HA","S\S","SS","MULTI-COLOUR","PE","1605","1998-08-19"
test_result_2021/test_result_2021_32361.csv:1450753305,1147728102,"2021-08-25","2","NT","P","5368","B","KAWASAKI","Z\ZR 1400","BLACK","PE","1400","2007-06-01"
test_result_2021/test_result_2021_32361.csv:771247505,320183670,"2021-10-01","3","NT","P","13550","NR","\URAL WOLF","UNCLASSIFIED","BLACK","PE","745","2003-06-01"
test_result_2021/test_result_2021_32365.csv:1064601483,401021322,"2021-07-13","4","RT","P","111560","NP","MITSUBISHI","PA\JER0","BLUE","DI","","1995-01-01"
test_result_2021/test_result_2021_32367.csv:127450681,107244336,"2021-05-25","2","NT","P","28498","RH","MOTOGUZZ\","NEVADA  750","BEIGE","PE","749","1996-01-01"
test_result_2021/test_result_2021_32378.csv:1799556497,401021322,"2021-07-09","4","NT","F","111560","NP","MITSUBISHI","PA\JER0","BLUE","DI","","1995-01-01"
test_result_2021/test_result_2021_32386.csv:910130147,471105022,"2021-09-29","4","NT","P","124896","EH","MERCEDES-BENZ","MOTORHOME\CAMPER","WHITE","DI","2148","2008-08-07"

Backslash is just used as a normal character in these strings, it is safe to use " as the escape character….

D COPY test_result FROM 'test_result_2021/test_result_2021_*.csv' (header true, escape '"');
100% ▕████████████████████████████████████████████████████████████▏ 
D select date_trunc('year',test_date), count(*) from test_result group by date_trunc('year',test_date);
┌───────────────────────────────┬──────────────┐
│ date_trunc('year', test_date) │ count_star() │
│             date              │    int64     │
├───────────────────────────────┼──────────────┤
│ 2021-01-01                    │     40380646 │
│ 2022-01-01                    │     41632878 │
│ 2023-01-01                    │     42216721 │
└───────────────────────────────┴──────────────┘
D

Successful load and total line numbers tie up with csv file line counts.

2020 #

Multiple files and comma separation, lets start with just detailing the header row in hte options and seeing if the import is clean.

D COPY test_result FROM 'dft_test_result-from-2020*.csv' (header true);
Conversion Error:
CSV Error on Line: 1037445
Original Line: 1583831865,1215549991,2020-01-13,4,NT,P,188842,CM,MERCEDES-BENZ,"SPRINTER, AMBULANCE",YELLOW,DI,2987,2016-01-11
Error when converting column "cylinder_capacity". Could not convert string "DI" to 'BIGINT'

Column cylinder_capacity is being converted as type BIGINT
This type was auto-detected from the CSV file.
Possible solutions:
* Override the type for this column manually by setting the type explicitly, e.g. types={'cylinder_capacity': 'VARCHAR'}
* Set the sample size to a larger value to enable the auto-detection to scan more values, e.g. sample_size=-1
* Use a COPY statement to automatically derive types from an existing table.

  file = dft_test_result-from-2020-01-01_00-00-00-to-2020-04-01_00-00-00.csv
  delimiter = , (Auto-Detected)
  quote = \0 (Auto-Detected)
  escape = \0 (Auto-Detected)
  new_line = \n (Auto-Detected)
  header = true (Set By User)
  skip_rows = 0 (Auto-Detected)
  comment = \0 (Auto-Detected)
  strict_mode = true (Auto-Detected)
  date_format =  (Auto-Detected)
  timestamp_format =  (Auto-Detected)
  null_padding = 0
  sample_size = 20480
  ignore_errors = false
  all_varchar = 0


D

The strings only have quoting if they contain a comma, the csv format identifier clearly didn’t come across such a row in the initial first few it looked at to determine the options. Let’s tell it to use double-quotes as the quote.

D COPY test_result FROM 'dft_test_result-from-2020*.csv' (header true, quote '"');
Invalid Input Error:
CSV Error on Line: 95550
Original Line: 229129217,149149016,2020-01-02,4,RT,P,164033,PL,LAND ROVER,"90\"   HARD/TOP",BLUE,DI,2496,1987-02-12
Value with unterminated quote found.

Possible fixes:
* Disable the parser's strict mode (strict_mode=false) to allow reading rows that do not comply with the CSV standard.
* Enable ignore errors (ignore_errors=true) to skip this row
* Set quote to empty or to a different value (e.g., quote='')

  file = dft_test_result-from-2020-01-01_00-00-00-to-2020-04-01_00-00-00.csv
  delimiter = , (Auto-Detected)
  quote = " (Set By User)
  escape = \0 (Auto-Detected)
  new_line = \n (Auto-Detected)
  header = true (Set By User)
  skip_rows = 0 (Auto-Detected)
  comment = \0 (Auto-Detected)
  strict_mode = true (Auto-Detected)
  date_format =  (Auto-Detected)
  timestamp_format =  (Auto-Detected)
  null_padding = 0
  sample_size = 20480
  ignore_errors = false
  all_varchar = 0


D COPY test_result FROM 'dft_test_result-from-2020*.csv' (header true, quote '"', escape '\');
100% ▕████████████████████████████████████████████████████████████▏ 
D select date_trunc('year',test_date), count(*) from test_result group by date_trunc('year',test_date);
┌───────────────────────────────┬──────────────┐
│ date_trunc('year', test_date) │ count_star() │
│             date              │    int64     │
├───────────────────────────────┼──────────────┤
│ 2020-01-01                    │     38594013 │
│ 2021-01-01                    │     40380646 │
│ 2022-01-01                    │     41632878 │
│ 2023-01-01                    │     42216721 │
└───────────────────────────────┴──────────────┘
D

OK, so specifying the double-quote as quoted resolved that issue, but another appeared with the escape character not being correctly identified either. Setting this allows a clean import and again row count matches with the line count of the csv files.

2019 #

The file naming for 2019 is the same as 2020, so hopefully the same import command will work…

D COPY test_result FROM 'dft_test_result-from-2019*.csv' (header true, quote '"', escape '\');
100% ▕████████████████████████████████████████████████████████████▏ 
D select date_trunc('year',test_date), count(*) from test_result group by date_trunc('year',test_date);
┌───────────────────────────────┬──────────────┐
│ date_trunc('year', test_date) │ count_star() │
│             date              │    int64     │
├───────────────────────────────┼──────────────┤
│ 2019-01-01                    │     39310698 │
│ 2020-01-01                    │     38594013 │
│ 2021-01-01                    │     40380646 │
│ 2022-01-01                    │     41632878 │
│ 2023-01-01                    │     42216721 │
└───────────────────────────────┴──────────────┘
D

Perfect and row count ties up again.

2018 #

Again has the same file naming convention as 2020.

D COPY test_result FROM 'dft_test_result-from-2018*.csv' (header true, quote '"', escape '\');
100% ▕████████████████████████████████████████████████████████████▏ 
D select date_trunc('year',test_date), count(*) from test_result group by date_trunc('year',test_date);
┌───────────────────────────────┬──────────────┐
│ date_trunc('year', test_date) │ count_star() │
│             date              │    int64     │
├───────────────────────────────┼──────────────┤
│ 2018-01-01                    │     38681801 │
│ 2019-01-01                    │     39310698 │
│ 2020-01-01                    │     38594013 │
│ 2021-01-01                    │     40380646 │
│ 2022-01-01                    │     41632878 │
│ 2023-01-01                    │     42216721 │
└───────────────────────────────┴──────────────┘
D

Again clean import and row count and file lines match up.

2017 #

Different file name convention for 2017 and from the getting data process these files are pipe delimited.

D COPY test_result FROM 'test_result_2017_*.csv' (delim '|', header true);
100% ▕████████████████████████████████████████████████████████████▏ 
D select date_trunc('year',test_date), count(*) from test_result group by date_trunc('year',test_date);
┌───────────────────────────────┬──────────────┐
│ date_trunc('year', test_date) │ count_star() │
│             date              │    int64     │
├───────────────────────────────┼──────────────┤
│ 2017-01-01                    │     38056161 │
│ 2018-01-01                    │     38681801 │
│ 2019-01-01                    │     39310698 │
│ 2020-01-01                    │     38594013 │
│ 2021-01-01                    │     40380646 │
│ 2022-01-01                    │     41632878 │
│ 2023-01-01                    │     42216721 │
└───────────────────────────────┴──────────────┘
D

Clean import and row count ties up with csv line count.

2016 #

For the rest of the files, the file name convention matches and contents match with 2017.

D COPY test_result FROM 'test_result_2016.txt' (delim '|', header true);
 86% ▕███████████████████████████████████████████████████▌        ▏ Conversion Error:
CSV Error on Line: 35021189
Original Line: 375221213|1464922263|2016-11-24|4|NT|P|56379|PL|TOYOTA|Estima |||SILVER|PE|3000|2005-10-17
Error when converting column "cylinder_capacity". Could not convert string "SILVER" to 'BIGINT'

Column cylinder_capacity is being converted as type BIGINT
This type was auto-detected from the CSV file.
Possible solutions:
* Override the type for this column manually by setting the type explicitly, e.g. types={'cylinder_capacity': 'VARCHAR'}
* Set the sample size to a larger value to enable the auto-detection to scan more values, e.g. sample_size=-1
* Use a COPY statement to automatically derive types from an existing table.

  file = test_result_2016.txt
  delimiter = | (Set By User)
  quote = \0 (Auto-Detected)
  escape = \0 (Auto-Detected)
  new_line = \n (Auto-Detected)
  header = true (Set By User)
  skip_rows = 0 (Auto-Detected)
  comment = \0 (Auto-Detected)
  strict_mode = true (Auto-Detected)
  date_format =  (Auto-Detected)
  timestamp_format =  (Auto-Detected)
  null_padding = 0
  sample_size = 20480
  ignore_errors = false
  all_varchar = 0

D

Thought this might be too easy! So in this case the vehicle model has the delimiter | in the field. Let’s see how many times this occurs.

simon@NUC:~/Documents/mot_data$ grep ' ||' test_result_20*.txt 
test_result_2015.txt:233955117|1464922263|2015-11-20|4|NT|PRS|49828|UB|TOYOTA|Estima |||SILVER|PE|3000|2005-10-17
test_result_2015.txt:386678619|1464922263|2015-11-20|4|NT|PRS|49828|UB|TOYOTA|Estima |||SILVER|PE|3000|2005-10-17
test_result_2015.txt:1359254837|223200525|2015-11-24|4|NT|F|54669|UB|TOYOTA|Estima |||WHITE|PE|2360|2005-06-17
test_result_2015.txt:904428237|223200525|2015-12-14|4|NT|P|54669|UB|TOYOTA|Estima |||WHITE|PE|2360|2005-06-17
test_result_2016.txt:375221213|1464922263|2016-11-24|4|NT|P|56379|PL|TOYOTA|Estima |||SILVER|PE|3000|2005-10-17
test_result_2016.txt:869423849|223200525|2016-12-19|4|NT|P|39962|NW|TOYOTA|Estima |||WHITE|PE|2360|2005-06-17
simon@NUC:~/Documents/mot_data$ grep '|||' test_result_20*.txt 
test_result_2015.txt:233955117|1464922263|2015-11-20|4|NT|PRS|49828|UB|TOYOTA|Estima |||SILVER|PE|3000|2005-10-17
test_result_2015.txt:386678619|1464922263|2015-11-20|4|NT|PRS|49828|UB|TOYOTA|Estima |||SILVER|PE|3000|2005-10-17
test_result_2015.txt:1359254837|223200525|2015-11-24|4|NT|F|54669|UB|TOYOTA|Estima |||WHITE|PE|2360|2005-06-17
test_result_2015.txt:904428237|223200525|2015-12-14|4|NT|P|54669|UB|TOYOTA|Estima |||WHITE|PE|2360|2005-06-17
test_result_2016.txt:375221213|1464922263|2016-11-24|4|NT|P|56379|PL|TOYOTA|Estima |||SILVER|PE|3000|2005-10-17
test_result_2016.txt:869423849|223200525|2016-12-19|4|NT|P|39962|NW|TOYOTA|Estima |||WHITE|PE|2360|2005-06-17
simon@NUC:~/Documents/mot_data$ grep '\\' test_result_20*.txt 
test_result_2005.txt:1730195562|878987390|2005-08-05|0|NT|P|37255|SM|C\\TROEN|UNCLASSIFIED|WHITE|PE||
test_result_2005.txt:1441279620|795826018|2005-08-09|0|NT|P|27241|CO|N\\SSAN|UNCLASSIFIED|BLUE|PE||
test_result_2005.txt:393314624|71595930|2005-08-23|4|NT|F|77988|NE|VOLKSWAGEN|\\T25 TRANSPORTER|YELLOW|PE|1970|1981-09-10
test_result_2005.txt:1111584564|432493652|2005-09-07|0|NT|P|34052|CM|F\\AT|UNCLASSIFIED|BLUE|PE||
test_result_2005.txt:649205540|71595930|2005-09-24|4|RT|P|78640|NE|VOLKSWAGEN|\\T25 TRANSPORTER|YELLOW|PE|1970|1981-09-10
test_result_2005.txt:1206617148|1544270|2005-09-27|0|NT|P|98136|LS|TR\\UMPH|UNCLASSIFIED|PURPLE|PE||
test_result_2005.txt:180093452|108497936|2005-11-21|4|NT|P|17082|AB|MORRIS|10\\4|RED|PE|1292|1935-12-31
test_result_2005.txt:1210685872|262041336|2005-12-05|4|NT|F|106530|M|NISSAN|SUNNY GTI\\R|SILVER|PE|1990|1994-01-01
test_result_2005.txt:869497984|262041336|2005-12-06|4|RT|P|106530|M|NISSAN|SUNNY GTI\\R|SILVER|PE|1990|1994-01-01
test_result_2006.txt:919485480|420164440|2006-01-09|4|NT|P|15011|BS|TRIUMPH|VITESSE\\|BLUE|PE|2000|1971-07-01
test_result_2006.txt:63101176|401021322|2006-01-23|4|NT|P|56343|DY|MITSUBISHI|PA\\JER0|BLUE|DI||1995-01-01
test_result_2006.txt:787669324|1257173890|2006-01-27|4|NT|P|59813|CH|MITSUBISHI|L\\GHT4X4|RED|DI|2470|1994-12-31
test_result_2006.txt:1919893226|546239586|2006-02-14|4|NT|F|11211|CB|NISSAN|FIGA\\RO|BLUE|PE|980|1991-12-31
test_result_2006.txt:1962541712|546239586|2006-02-14|4|RT|P|11211|CB|NISSAN|FIGA\\RO|BLUE|PE|980|1991-12-31
test_result_2006.txt:37193822|307785484|2006-02-17|2|NT|P|39710|PE|CAGIVA|SUPPYC\\TY|BLACK|PE|125|1993-12-31
test_result_2006.txt:997810162|507943506|2006-03-25|1|NT|F|11800|HA|MOTOR HISPANIA|\\|GREEN|PE|49|2001-07-01
...

Trying to find these || are difficult as empty strings are represented as ||, hence looking for ’ ||’ and ‘|||’ as a way to identify how many of these there are. There could be others, but looking at this data across all years from 2005 to 2016 it seems just these. Let’s work with that first.

The Make & Model is entered free text by the MOT examiner, here maybe the entered || to signify the second generation.

Checking for backslashes as the escape character indicates they are used and the examples from the grep are all escaping the backslash character itself, this means I can’t use | as the escape character like the double-quote was used in the 2021 data.

Will need to do a quick sed to change the || to a 2.

simon@NUC:~/Documents/mot_data$ sed 's/|||/2|/g' test_result_2016.txt > test_result_2016_processed.txt
simon@NUC:~/Documents/mot_data$ diff test_result_2016
test_result_2016_processed.txt  test_result_2016.txt            
simon@NUC:~/Documents/mot_data$ diff test_result_2016.txt test_result_2016_processed.txt 
35021189c35021189
< 375221213|1464922263|2016-11-24|4|NT|P|56379|PL|TOYOTA|Estima |||SILVER|PE|3000|2005-10-17
---
> 375221213|1464922263|2016-11-24|4|NT|P|56379|PL|TOYOTA|Estima 2|SILVER|PE|3000|2005-10-17
37070120c37070120
< 869423849|223200525|2016-12-19|4|NT|P|39962|NW|TOYOTA|Estima |||WHITE|PE|2360|2005-06-17
---
> 869423849|223200525|2016-12-19|4|NT|P|39962|NW|TOYOTA|Estima 2|WHITE|PE|2360|2005-06-17

A quick diff confirms just those lines have been changed and are now correct.

D COPY test_result FROM 'test_result_2016_processed.txt' (delim '|', header true);
100% ▕████████████████████████████████████████████████████████████▏ 
D select date_trunc('year',test_date), count(*) from test_result group by date_trunc('year',test_date);
┌───────────────────────────────┬──────────────┐
│ date_trunc('year', test_date) │ count_star() │
│             date              │    int64     │
├───────────────────────────────┼──────────────┤
│ 2016-01-01                    │     37693380 │
│ 2017-01-01                    │     38056161 │
│ 2018-01-01                    │     38681801 │
│ 2019-01-01                    │     39310698 │
│ 2020-01-01                    │     38594013 │
│ 2021-01-01                    │     40380646 │
│ 2022-01-01                    │     41632878 │
│ 2023-01-01                    │     42216721 │
└───────────────────────────────┴──────────────┘
D

And now we have a clean import, with total rows matching file line count.

2015 #

Will need to repeat the sed for this year.

simon@NUC:~/Documents/mot_data$ sed 's/|||/2|/g' test_result_2015.txt > test_result_2015_processed.txt
simon@NUC:~/Documents/mot_data$ diff test_result_2015.txt test_result_2015_processed.txt 
34549797c34549797
< 233955117|1464922263|2015-11-20|4|NT|PRS|49828|UB|TOYOTA|Estima |||SILVER|PE|3000|2005-10-17
---
> 233955117|1464922263|2015-11-20|4|NT|PRS|49828|UB|TOYOTA|Estima 2|SILVER|PE|3000|2005-10-17
34590574c34590574
< 386678619|1464922263|2015-11-20|4|NT|PRS|49828|UB|TOYOTA|Estima |||SILVER|PE|3000|2005-10-17
---
> 386678619|1464922263|2015-11-20|4|NT|PRS|49828|UB|TOYOTA|Estima 2|SILVER|PE|3000|2005-10-17
34865010c34865010
< 1359254837|223200525|2015-11-24|4|NT|F|54669|UB|TOYOTA|Estima |||WHITE|PE|2360|2005-06-17
---
> 1359254837|223200525|2015-11-24|4|NT|F|54669|UB|TOYOTA|Estima 2|WHITE|PE|2360|2005-06-17
36564216c36564216
< 904428237|223200525|2015-12-14|4|NT|P|54669|UB|TOYOTA|Estima |||WHITE|PE|2360|2005-06-17
---
> 904428237|223200525|2015-12-14|4|NT|P|54669|UB|TOYOTA|Estima 2|WHITE|PE|2360|2005-06-17
simon@NUC:~/Documents/mot_data$

All looks good with the diff.

D COPY test_result FROM 'test_result_2015_processed.txt' (delim '|', header true);
100% ▕████████████████████████████████████████████████████████████▏ 
D select date_trunc('year',test_date), count(*) from test_result group by date_trunc('year',test_date);
┌───────────────────────────────┬──────────────┐
│ date_trunc('year', test_date) │ count_star() │
│             date              │    int64     │
├───────────────────────────────┼──────────────┤
│ 2015-01-01                    │     37490736 │
│ 2016-01-01                    │     37693380 │
│ 2017-01-01                    │     38056161 │
│ 2018-01-01                    │     38681801 │
│ 2019-01-01                    │     39310698 │
│ 2020-01-01                    │     38594013 │
│ 2021-01-01                    │     40380646 │
│ 2022-01-01                    │     41632878 │
│ 2023-01-01                    │     42216721 │
└───────────────────────────────┴──────────────┘
D

Clean import and again row count ties up with line count in file.

2014 #

No || used in the years from now on, hopefully clean imports.

D COPY test_result FROM 'test_result_2014.txt' (delim '|', header true);
100% ▕████████████████████████████████████████████████████████████▏ 
D select date_trunc('year',test_date), count(*) from test_result group by date_trunc('year',test_date);
┌───────────────────────────────┬──────────────┐
│ date_trunc('year', test_date) │ count_star() │
│             date              │    int64     │
├───────────────────────────────┼──────────────┤
│ 2014-01-01                    │     37493825 │
│ 2015-01-01                    │     37490736 │
│ 2016-01-01                    │     37693380 │
│ 2017-01-01                    │     38056161 │
│ 2018-01-01                    │     38681801 │
│ 2019-01-01                    │     39310698 │
│ 2020-01-01                    │     38594013 │
│ 2021-01-01                    │     40380646 │
│ 2022-01-01                    │     41632878 │
│ 2023-01-01                    │     42216721 │
├───────────────────────────────┴──────────────┤
│ 10 rows                            2 columns │
└──────────────────────────────────────────────┘
D

Successful import with no issues.

2013 #

D COPY test_result FROM 'test_result_2013.txt' (delim '|', header true);
D select date_trunc('year',test_date), count(*) from test_result group by date_trunc('year',test_date);
┌───────────────────────────────┬──────────────┐
│ date_trunc('year', test_date) │ count_star() │
│             date              │    int64     │
├───────────────────────────────┼──────────────┤
│ 2013-01-01                    │     37361925 │
│ 2014-01-01                    │     37493825 │
│ 2015-01-01                    │     37490736 │
│ 2016-01-01                    │     37693380 │
│ 2017-01-01                    │     38056161 │
│ 2018-01-01                    │     38681801 │
│ 2019-01-01                    │     39310698 │
│ 2020-01-01                    │     38594013 │
│ 2021-01-01                    │     40380646 │
│ 2022-01-01                    │     41632878 │
│ 2023-01-01                    │     42216721 │
├───────────────────────────────┴──────────────┤
│ 11 rows                            2 columns │
└──────────────────────────────────────────────┘
D

Further successful import.

2012 - 2005 #

These last few all import cleanly and the data validates with the file row counts.

D COPY test_result FROM 'test_result_2012.txt' (delim '|', header true);
100% ▕████████████████████████████████████████████████████████████▏
D COPY test_result FROM 'test_result_2011.txt' (delim '|', header true);
100% ▕████████████████████████████████████████████████████████████▏ 
D COPY test_result FROM 'test_result_2010.txt' (delim '|', header true);
100% ▕████████████████████████████████████████████████████████████▏ 
D COPY test_result FROM 'test_result_2009.txt' (delim '|', header true);
100% ▕████████████████████████████████████████████████████████████▏ 
D COPY test_result FROM 'test_result_2008.txt' (delim '|', header true);
100% ▕████████████████████████████████████████████████████████████▏ 
D COPY test_result FROM 'test_result_2007.txt' (delim '|', header true);
100% ▕████████████████████████████████████████████████████████████▏ 
D COPY test_result FROM 'test_result_2006.txt' (delim '|', header true);
100% ▕████████████████████████████████████████████████████████████▏ 
D COPY test_result FROM 'test_result_2005.txt' (delim '|', header true);
100% ▕████████████████████████████████████████████████████████████▏ 
D select date_trunc('year',test_date), count(*) from test_result group by date_trunc('year',test_date);
┌───────────────────────────────┬──────────────┐
│ date_trunc('year', test_date) │ count_star() │
│             date              │    int64     │
├───────────────────────────────┼──────────────┤
│ 2015-01-01                    │     37490736 │
│ 2013-01-01                    │     37361925 │
│ 2006-01-01                    │     32014080 │
│ 2019-01-01                    │     39310698 │
│ 2010-01-01                    │     36134920 │
│ 2018-01-01                    │     38681801 │
│ 2014-01-01                    │     37493825 │
│ 2012-01-01                    │     36846342 │
│ 2021-01-01                    │     40380646 │
│ 2008-01-01                    │     34439132 │
│ 2022-01-01                    │     41632878 │
│ 2020-01-01                    │     38594013 │
│ 2017-01-01                    │     38056161 │
│ 2009-01-01                    │     35436943 │
│ 2005-01-01                    │      7499744 │
│ 2023-01-01                    │     42216721 │
│ 2016-01-01                    │     37693380 │
│ 2007-01-01                    │     33591238 │
│ 2011-01-01                    │     36849154 │
├───────────────────────────────┴──────────────┤
│ 19 rows                            2 columns │
└──────────────────────────────────────────────┘
D select count(*) from test_result;
┌──────────────────┐
│   count_star()   │
│      int64       │
├──────────────────┤
│    681724337     │
│ (681.72 million) │
└──────────────────┘
D

Lookup Tables #

A set of lookup tables are provided for several of the fields in test_data. These are quickly loaded with no issues on import.

D CREATE TABLE item_detail AS SELECT * FROM read_csv('lookup/item_detail.csv', delim = '|', header = true);
D CREATE TABLE item_group AS SELECT * FROM read_csv('lookup/item_group.csv', delim = '|', header = true);
D CREATE TABLE mdr_fuel_types AS SELECT * FROM read_csv('lookup/mdr_fuel_types.csv', delim = '|', header = true);
D CREATE TABLE mdr_rfr_locations AS SELECT * FROM read_csv('lookup/mdr_rfr_location.csv', delim = '|', header = true);
D CREATE TABLE mdr_test_outcome AS SELECT * FROM read_csv('lookup/mdr_test_outcome.csv', delim = '|', header = true);
D CREATE TABLE mdr_test_type AS SELECT * FROM read_csv('lookup/mdr_test_type.csv', delim = '|', header = true);
D

Next Step #

Now we have a dataset of over half a billion rows, time to start some analysis and see how DuckDB can perform at this scale of data.