This is a substitute for a live lecture. It draws on the British road safety data from 2024-11-15_22.

Read this file in its rendered (html) format and try to think yourself how you would proceed. You can sketch it by hand on paper, if you don’t trust yourself to code it all. The most important thing here is to understand why and how to manipulate the tables.

The context

The British government published statistics of road accidents. We investigated a table where each row represented a casualty; that is, a person who died or was severely injured in an accident. Each accident had a unique ID. Each person only had a unique ID within that given accident. The table contained many other columns with details about the person (driver, passenger or pedestrian, age, sex, etc.), as well as what type of vehicle the person was on.

We extracted the worst accident by grouping the table by accident IDs and counting the rows (ie. casualties). The worst accident resulted in 70 casualties and happened in 2023. It looked like a crash of a bus/coach with a smaller vehicle.

In this session, we are going to join the casualties data with another table, which lists the accidents. It contains details about the vehicle types and some circumstances recorded by the police (e.g. weather, location, road type).

You possibly remember that we had to decode the values of categorical variables such as sex (1 represented male, 2 represented female, 9 represented person did not want to tell, and -1 represented impossible to retrieve or similar). You will get the casualties data already decoded, but you will do it with the accident data.

This exercise is a super-practical demonstration of dplyr::..._join and tidyr::pivot_... in one workflow! Enjoy!

Libraries

library(readr, warn.conflicts = FALSE, quietly = TRUE)
library(dplyr, warn.conflicts = FALSE, quietly = TRUE)
library(tidyr, warn.conflicts = FALSE, quietly = TRUE)

The worst British accident in the past five years

This is what we found out previously:

worst_accident <- read_tsv("all_casualties_labeled.tsv", 
                           show_col_types = FALSE) %>% 
  group_by(accident_reference) %>% 
  add_count() %>%
  ungroup() %>%
  slice_max(order_by = n) %>%
  select(-n)
knitr::kable(worst_accident)
ID rowid accident_reference accident_year age_of_casualty vehicle_reference casualty_class casualty_type sex_of_casualty casualty_severity
rc_2023 647399 520300610 2023 39 1 Driver or rider Motorcycle over 125cc and up to 500cc rider or passenger Male Slight
rc_2023 647400 520300610 2023 58 2 Driver or rider Bus or coach occupant (17 or more pass seats) Male Slight
rc_2023 647401 520300610 2023 55 2 Passenger Bus or coach occupant (17 or more pass seats) Male Slight
rc_2023 647402 520300610 2023 45 2 Passenger Bus or coach occupant (17 or more pass seats) Male Slight
rc_2023 647403 520300610 2023 31 2 Passenger Bus or coach occupant (17 or more pass seats) Male Serious
rc_2023 647404 520300610 2023 28 2 Passenger Bus or coach occupant (17 or more pass seats) Male Serious
rc_2023 647405 520300610 2023 40 2 Passenger Bus or coach occupant (17 or more pass seats) Male Slight
rc_2023 647406 520300610 2023 22 2 Passenger Bus or coach occupant (17 or more pass seats) Male Slight
rc_2023 647407 520300610 2023 42 2 Passenger Bus or coach occupant (17 or more pass seats) Male Serious
rc_2023 647408 520300610 2023 25 2 Passenger Bus or coach occupant (17 or more pass seats) Male Slight
rc_2023 647409 520300610 2023 39 2 Passenger Bus or coach occupant (17 or more pass seats) Male Slight
rc_2023 647410 520300610 2023 45 2 Passenger Bus or coach occupant (17 or more pass seats) Male Slight
rc_2023 647411 520300610 2023 57 2 Passenger Bus or coach occupant (17 or more pass seats) Male Serious
rc_2023 647412 520300610 2023 46 2 Passenger Bus or coach occupant (17 or more pass seats) Male Slight
rc_2023 647413 520300610 2023 44 2 Passenger Bus or coach occupant (17 or more pass seats) Male Slight
rc_2023 647414 520300610 2023 43 2 Passenger Bus or coach occupant (17 or more pass seats) Male Slight
rc_2023 647415 520300610 2023 34 2 Passenger Bus or coach occupant (17 or more pass seats) Male Serious
rc_2023 647416 520300610 2023 29 2 Passenger Bus or coach occupant (17 or more pass seats) Male Slight
rc_2023 647417 520300610 2023 28 2 Passenger Bus or coach occupant (17 or more pass seats) Male Slight
rc_2023 647418 520300610 2023 41 2 Passenger Bus or coach occupant (17 or more pass seats) Male Slight
rc_2023 647419 520300610 2023 63 2 Passenger Bus or coach occupant (17 or more pass seats) Male Slight
rc_2023 647420 520300610 2023 54 2 Passenger Bus or coach occupant (17 or more pass seats) Male Slight
rc_2023 647421 520300610 2023 39 2 Passenger Bus or coach occupant (17 or more pass seats) Male Slight
rc_2023 647422 520300610 2023 42 2 Passenger Bus or coach occupant (17 or more pass seats) Male Slight
rc_2023 647423 520300610 2023 31 2 Passenger Bus or coach occupant (17 or more pass seats) Male Slight
rc_2023 647424 520300610 2023 64 2 Passenger Bus or coach occupant (17 or more pass seats) Male Slight
rc_2023 647425 520300610 2023 27 2 Passenger Bus or coach occupant (17 or more pass seats) Male Slight
rc_2023 647426 520300610 2023 38 2 Passenger Bus or coach occupant (17 or more pass seats) Male Serious
rc_2023 647427 520300610 2023 31 2 Passenger Bus or coach occupant (17 or more pass seats) Male Slight
rc_2023 647428 520300610 2023 52 2 Passenger Bus or coach occupant (17 or more pass seats) Male Serious
rc_2023 647429 520300610 2023 25 2 Passenger Bus or coach occupant (17 or more pass seats) Male Slight
rc_2023 647430 520300610 2023 27 2 Passenger Bus or coach occupant (17 or more pass seats) Female Slight
rc_2023 647431 520300610 2023 54 2 Passenger Bus or coach occupant (17 or more pass seats) Male Slight
rc_2023 647432 520300610 2023 30 2 Passenger Bus or coach occupant (17 or more pass seats) Male Serious
rc_2023 647433 520300610 2023 40 2 Passenger Bus or coach occupant (17 or more pass seats) Female Serious
rc_2023 647434 520300610 2023 45 2 Passenger Bus or coach occupant (17 or more pass seats) Male Slight
rc_2023 647435 520300610 2023 42 2 Passenger Bus or coach occupant (17 or more pass seats) Male Slight
rc_2023 647436 520300610 2023 50 2 Passenger Bus or coach occupant (17 or more pass seats) Male Slight
rc_2023 647437 520300610 2023 35 2 Passenger Bus or coach occupant (17 or more pass seats) Male Slight
rc_2023 647438 520300610 2023 35 2 Passenger Bus or coach occupant (17 or more pass seats) Male Serious
rc_2023 647439 520300610 2023 30 2 Passenger Bus or coach occupant (17 or more pass seats) Male Serious
rc_2023 647440 520300610 2023 41 2 Passenger Bus or coach occupant (17 or more pass seats) Male Slight
rc_2023 647441 520300610 2023 50 2 Passenger Bus or coach occupant (17 or more pass seats) Male Slight
rc_2023 647442 520300610 2023 27 2 Passenger Bus or coach occupant (17 or more pass seats) Male Serious
rc_2023 647443 520300610 2023 43 2 Passenger Bus or coach occupant (17 or more pass seats) Male Slight
rc_2023 647444 520300610 2023 34 2 Passenger Bus or coach occupant (17 or more pass seats) Male Slight
rc_2023 647445 520300610 2023 39 2 Passenger Bus or coach occupant (17 or more pass seats) Male Slight
rc_2023 647446 520300610 2023 41 2 Passenger Bus or coach occupant (17 or more pass seats) Male Slight
rc_2023 647447 520300610 2023 30 2 Passenger Bus or coach occupant (17 or more pass seats) Male Serious
rc_2023 647448 520300610 2023 25 2 Passenger Bus or coach occupant (17 or more pass seats) Female Serious
rc_2023 647449 520300610 2023 39 2 Passenger Bus or coach occupant (17 or more pass seats) Male Serious
rc_2023 647450 520300610 2023 31 2 Passenger Bus or coach occupant (17 or more pass seats) Male Slight
rc_2023 647451 520300610 2023 34 2 Passenger Bus or coach occupant (17 or more pass seats) Male Slight
rc_2023 647452 520300610 2023 36 2 Passenger Bus or coach occupant (17 or more pass seats) Male Slight
rc_2023 647453 520300610 2023 52 2 Passenger Bus or coach occupant (17 or more pass seats) Male Slight
rc_2023 647454 520300610 2023 55 2 Passenger Bus or coach occupant (17 or more pass seats) Male Slight
rc_2023 647455 520300610 2023 61 2 Passenger Bus or coach occupant (17 or more pass seats) Male Serious
rc_2023 647456 520300610 2023 23 2 Passenger Bus or coach occupant (17 or more pass seats) Male Slight
rc_2023 647457 520300610 2023 35 2 Passenger Bus or coach occupant (17 or more pass seats) Male Slight
rc_2023 647458 520300610 2023 36 2 Passenger Bus or coach occupant (17 or more pass seats) Male Serious
rc_2023 647459 520300610 2023 49 2 Passenger Bus or coach occupant (17 or more pass seats) Male Slight
rc_2023 647460 520300610 2023 25 2 Passenger Bus or coach occupant (17 or more pass seats) Male Slight
rc_2023 647461 520300610 2023 61 2 Passenger Bus or coach occupant (17 or more pass seats) Male Slight
rc_2023 647462 520300610 2023 27 2 Passenger Bus or coach occupant (17 or more pass seats) Male Serious
rc_2023 647463 520300610 2023 24 2 Passenger Bus or coach occupant (17 or more pass seats) Male Slight
rc_2023 647464 520300610 2023 34 2 Passenger Bus or coach occupant (17 or more pass seats) Male Slight
rc_2023 647465 520300610 2023 20 2 Passenger Bus or coach occupant (17 or more pass seats) Male Slight
rc_2023 647466 520300610 2023 31 2 Passenger Bus or coach occupant (17 or more pass seats) Male Slight
rc_2023 647467 520300610 2023 53 2 Passenger Bus or coach occupant (17 or more pass seats) Male Slight
rc_2023 647468 520300610 2023 31 2 Passenger Bus or coach occupant (17 or more pass seats) Male Slight

Try and make sense of the data by grouping it by vehicle_reference, casualty_class, casualty_type, casualty_severity, and accident_year. Of course they are all going to have the same accident_year, but you will not lose the information which year it is.

worst_accident %>%
  group_by(vehicle_reference, 
           casualty_class, 
           casualty_type, 
           casualty_severity, 
           accident_year) %>% 
  count() %>%
  knitr::kable()
vehicle_reference casualty_class casualty_type casualty_severity accident_year n
1 Driver or rider Motorcycle over 125cc and up to 500cc rider or passenger Slight 2023 1
2 Driver or rider Bus or coach occupant (17 or more pass seats) Slight 2023 1
2 Passenger Bus or coach occupant (17 or more pass seats) Serious 2023 18
2 Passenger Bus or coach occupant (17 or more pass seats) Slight 2023 50

It looks as if a motorbike crashed with a bus/coach. Fortunately, nobody died in this accident.

Decoded and filtered Collisions table

The file all_about_the_accident.csv contains the relevant data from the 2023 Collision, so read it in and use the new information to describe the accident in more detail.

all_about <- readr::read_csv("all_about_the_accident.csv", show_col_types = FALSE) 
  knitr::kable( all_about ) # knitr::kable just renders the table prettier
rowid vehicle_reference vehicle_type towing_and_articulation vehicle_manoeuvre vehicle_direction_from vehicle_direction_to vehicle_location_restricted_lane junction_location skidding_and_overturning hit_object_in_carriageway vehicle_leaving_carriageway hit_object_off_carriageway first_point_of_impact vehicle_left_hand_drive journey_purpose_of_driver sex_of_driver age_of_driver age_band_of_driver engine_capacity_cc propulsion_code age_of_vehicle generic_make_model driver_imd_decile driver_home_area_type lsoa_of_driver escooter_flag driver_distance_banding
1 1 Motorcycle over 125cc and up to 500cc No tow/articulation Going ahead left-hand bend South East North On main c’way - not in restricted lane Not at or within 20 metres of junction Skidded None Did not leave carriageway None Nearside No Not known Male 39 36 - 45 955 Petrol 21 Data missing or out of range More deprived 20-30% Urban area E01029291 Vehicle was not an e-scooter Collision occurred between 10.001 and 20km of drivers home postcode
2 2 Bus or coach (17 or more pass seats) No tow/articulation Going ahead left-hand bend South East North On main c’way - not in restricted lane Not at or within 20 metres of junction Skidded and overturned Previous accident Did not leave carriageway None Front No Journey as part of work Male 58 56 - 65 5132 Heavy oil 4 WRIGHTBUS STREETDECK More deprived 20-30% Urban area E01014807 Vehicle was not an e-scooter Collision occurred between 20.001 and 100km of drivers home postcode
3 3 Bus or coach (17 or more pass seats) No tow/articulation Going ahead left-hand bend South East North On main c’way - not in restricted lane Not at or within 20 metres of junction Skidded Previous accident Did not leave carriageway None Front No Journey as part of work Male 54 46 - 55 10837 Heavy oil 5 Data missing or out of range Less deprived 30-40% Small town E01029267 Vehicle was not an e-scooter Collision occurred between 10.001 and 20km of drivers home postcode

Clue in the code snippet below.

# We had to revise the number of vehicles involved, hadn't we?
# It was a motorbike and TWO busses. We couldn't know about the third one from the casualties data because Vehicle 3  contributed with no casualties. And we did not suspect there was a third like we would have had if there were casualties on the third vehicle but none on the second or first. 
# What happened: the motorbike driver skidded (slid) on the surface when bending left on a c-road. It was not a junction. In the same direction followed a bus (Vehicle 2). They probably crashed so that the biker landed in front of the bus, slightly turned orthogonally, so that the bus bumped into its front side with its own front. Or the bike driver fell off his motorbike. The driver was probably doing his best to avoid overrunning the driver and probably hit just the motorbike (hit_object_in_carriageway - driver has none, this looks like falling off, no? The bus has Previous accident, that will be the motorbike ). Anyway, the bus also skidded and even overturned. Another bus (Vehicle 3) finally hit Vehicle 2 or Vehicle 1. All casualties except the biker were sitting on Vehicle 2, that is, the earlier bus that overturned.   
# The motorbike was rather old, the buses rather new. None of the drivers was extremely young or extremely old. We don't know whether anyone was to blame. 

Now, we could of course learn more about the casualties just by eyeballing, but how about adding the casualty severity of drivers to this table?

First, make a drivers table by filtering just drivers from the worst_accident table and select the columns casualty_class, casualty_severity, accident_reference, and vehicle_reference.

drivers <- worst_accident %>% 
  filter(casualty_class == "Driver or rider") %>%
  select(casualty_class, casualty_severity, accident_reference, vehicle_reference, )
drivers %>% knitr::kable()
casualty_class casualty_severity accident_reference vehicle_reference
Driver or rider Slight 520300610 1
Driver or rider Slight 520300610 2

Join the drivers and all_about tables so that it contains all vehicles. You can just print it out without saving in a variable.

drivers %>% full_join(all_about, by = "vehicle_reference") %>%
  knitr::kable()
casualty_class casualty_severity accident_reference vehicle_reference rowid vehicle_type towing_and_articulation vehicle_manoeuvre vehicle_direction_from vehicle_direction_to vehicle_location_restricted_lane junction_location skidding_and_overturning hit_object_in_carriageway vehicle_leaving_carriageway hit_object_off_carriageway first_point_of_impact vehicle_left_hand_drive journey_purpose_of_driver sex_of_driver age_of_driver age_band_of_driver engine_capacity_cc propulsion_code age_of_vehicle generic_make_model driver_imd_decile driver_home_area_type lsoa_of_driver escooter_flag driver_distance_banding
Driver or rider Slight 520300610 1 1 Motorcycle over 125cc and up to 500cc No tow/articulation Going ahead left-hand bend South East North On main c’way - not in restricted lane Not at or within 20 metres of junction Skidded None Did not leave carriageway None Nearside No Not known Male 39 36 - 45 955 Petrol 21 Data missing or out of range More deprived 20-30% Urban area E01029291 Vehicle was not an e-scooter Collision occurred between 10.001 and 20km of drivers home postcode
Driver or rider Slight 520300610 2 2 Bus or coach (17 or more pass seats) No tow/articulation Going ahead left-hand bend South East North On main c’way - not in restricted lane Not at or within 20 metres of junction Skidded and overturned Previous accident Did not leave carriageway None Front No Journey as part of work Male 58 56 - 65 5132 Heavy oil 4 WRIGHTBUS STREETDECK More deprived 20-30% Urban area E01014807 Vehicle was not an e-scooter Collision occurred between 20.001 and 100km of drivers home postcode
NA NA NA 3 3 Bus or coach (17 or more pass seats) No tow/articulation Going ahead left-hand bend South East North On main c’way - not in restricted lane Not at or within 20 metres of junction Skidded Previous accident Did not leave carriageway None Front No Journey as part of work Male 54 46 - 55 10837 Heavy oil 5 Data missing or out of range Less deprived 30-40% Small town E01029267 Vehicle was not an e-scooter Collision occurred between 10.001 and 20km of drivers home postcode

Decoding of the encoded values with the metadata file guide.xlsx

Now we will concentrate on how to obtain the convenient all_about_the_accident.csv from the data that the British government has actually published.

To load the original data, read in collisions_2023.csv and save it in a variable called collisions.

collisions <- read_csv("collisions_2023.csv")
## Rows: 104258 Columns: 37
## ── Column specification ────────────────────────────────────────────────────────
## Delimiter: ","
## chr  (10): accident_index, accident_reference, location_easting_osgr, locati...
## dbl  (26): accident_year, police_force, accident_severity, number_of_vehicle...
## time  (1): time
## 
## ℹ Use `spec()` to retrieve the full column specification for this data.
## ℹ Specify the column types or set `show_col_types = FALSE` to quiet this message.

Now glimpse it.

glimpse(collisions)
## Rows: 104,258
## Columns: 37
## $ accident_index                              <chr> "2023010419171", "20230104…
## $ accident_year                               <dbl> 2023, 2023, 2023, 2023, 20…
## $ accident_reference                          <chr> "010419171", "010419183", …
## $ location_easting_osgr                       <chr> "525060", "535463", "50870…
## $ location_northing_osgr                      <chr> "170416", "198745", "17769…
## $ longitude                                   <chr> "-0.202878", "-0.042464", …
## $ latitude                                    <chr> "51.418974", "51.671155", …
## $ police_force                                <dbl> 1, 1, 1, 1, 1, 1, 1, 1, 1,…
## $ accident_severity                           <dbl> 3, 3, 3, 3, 3, 3, 3, 3, 3,…
## $ number_of_vehicles                          <dbl> 1, 3, 2, 2, 2, 1, 2, 1, 1,…
## $ number_of_casualties                        <dbl> 1, 2, 1, 1, 1, 1, 1, 1, 1,…
## $ date                                        <chr> "01/01/2023", "01/01/2023"…
## $ day_of_week                                 <dbl> 1, 1, 1, 1, 1, 1, 1, 1, 1,…
## $ time                                        <time> 01:24:00, 02:25:00, 03:50…
## $ local_authority_district                    <dbl> -1, -1, -1, -1, -1, -1, -1…
## $ local_authority_ons_district                <chr> "E09000024", "E09000010", …
## $ local_authority_highway                     <chr> "E09000024", "E09000010", …
## $ first_road_class                            <dbl> 5, 6, 3, 3, 3, 3, 3, 3, 3,…
## $ first_road_number                           <dbl> 0, 0, 437, 5, 3220, 5, 3, …
## $ road_type                                   <dbl> 2, 6, 1, 6, 6, 6, 6, 3, 6,…
## $ speed_limit                                 <dbl> 20, 30, 30, 30, 30, 30, 20…
## $ junction_detail                             <dbl> 9, 3, 1, 3, 8, 0, 6, 0, 0,…
## $ junction_control                            <dbl> 4, 4, 4, 4, 4, -1, 2, -1, …
## $ second_road_class                           <dbl> 5, 6, 6, 6, 6, 0, 3, 0, 0,…
## $ second_road_number                          <dbl> 0, 0, 0, 0, 0, -1, 3202, -…
## $ pedestrian_crossing_human_control           <dbl> 2, 0, 0, 0, 0, 2, 0, 0, 0,…
## $ pedestrian_crossing_physical_facilities     <dbl> 5, 1, 0, 0, 0, 5, 0, 0, 4,…
## $ light_conditions                            <dbl> 4, 4, 4, 4, 4, 4, 4, 1, 4,…
## $ weather_conditions                          <dbl> 8, 1, 1, 9, 1, 1, 1, 1, 1,…
## $ road_surface_conditions                     <dbl> 2, 1, 1, 1, 1, 1, 2, 1, 1,…
## $ special_conditions_at_site                  <dbl> 0, 0, 0, 0, 0, 0, 0, 0, 0,…
## $ carriageway_hazards                         <dbl> 0, 0, 0, 0, 0, 0, 0, 0, 0,…
## $ urban_or_rural_area                         <dbl> 1, 1, 1, 1, 1, 1, 1, 2, 1,…
## $ did_police_officer_attend_scene_of_accident <dbl> 1, 1, 1, 1, 1, 1, 1, 1, 1,…
## $ trunk_road_flag                             <dbl> 2, 2, 2, 2, 2, 2, 2, 2, 2,…
## $ lsoa_of_accident_location                   <chr> "E01003383", "E01001547", …
## $ enhanced_severity_collision                 <dbl> -1, -1, -1, -1, -1, -1, -1…

Values of obviously categorical variables are encoded with digits just like we saw it in the original casualties tables. Use the readxl library to read in the first worksheet of the excel file guide.xlsx and call the variable guide. And glimpse it. Or better, just open the xlsx file in Excel and look at it with all convenience.

library(readxl)
## Warning: package 'readxl' was built under R version 4.4.2
guide <- read_xlsx("guide.xlsx")
glimpse(guide)
## Rows: 1,785
## Columns: 5
## $ table         <chr> "accident", "accident", "accident", "accident", "acciden…
## $ `field name`  <chr> "collision_index", "collision_year", "collision_referenc…
## $ `code/format` <chr> NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, "1", "3", "4", "…
## $ label         <chr> NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, "Metropolitan Po…
## $ note          <chr> "unique value for each accident. The accident_index comb…

Obviously, the values in the column field name at least partly correspond to names of columns in the collisions data frame. The values in the column code/format correspond to values actually spotted in the collisions data frame.

If we transformed the collisions data frame to a long format with a column of column names and a column of code/format values of these column names, and if we kept the label column, we could join collisions with guide on the columns field name and code/format!!

Now, which columns should occur in the field name column? If we pick all, the long format could be really long (the current number of rows times the number of columns times the number of their observed values!), so it makes some sense to make a sensible selection. Below are the column names again, and you are welcome to read the note column in the Excel spreadsheet (in Excel) to make up your mind.

To be super sure that we don’t get very long data, we extract just the rows of collisions that represent the accident we focus on and save this selection in a variable called collisions_worst. First save the value of accident_reference from worst_accident as one number in a variable called accident_index_worst and then use this variable for filtering.

accident_index_worst <- worst_accident %>% 
  distinct(accident_reference) %>% 
  pull()
collisions_worst <- collisions %>% 
  filter(accident_reference == accident_index_worst)

Now glimpse collisions_worst

glimpse(collisions_worst)
## Rows: 1
## Columns: 37
## $ accident_index                              <chr> "2023520300610"
## $ accident_year                               <dbl> 2023
## $ accident_reference                          <chr> "520300610"
## $ location_easting_osgr                       <chr> "326970"
## $ location_northing_osgr                      <chr> "137916"
## $ longitude                                   <chr> "-3.045176"
## $ latitude                                    <chr> "51.135841"
## $ police_force                                <dbl> 52
## $ accident_severity                           <dbl> 2
## $ number_of_vehicles                          <dbl> 3
## $ number_of_casualties                        <dbl> 70
## $ date                                        <chr> "17/01/2023"
## $ day_of_week                                 <dbl> 3
## $ time                                        <time> 06:00:00
## $ local_authority_district                    <dbl> -1
## $ local_authority_ons_district                <chr> "E06000066"
## $ local_authority_highway                     <chr> "E06000066"
## $ first_road_class                            <dbl> 3
## $ first_road_number                           <dbl> 39
## $ road_type                                   <dbl> 6
## $ speed_limit                                 <dbl> 40
## $ junction_detail                             <dbl> 0
## $ junction_control                            <dbl> -1
## $ second_road_class                           <dbl> 0
## $ second_road_number                          <dbl> -1
## $ pedestrian_crossing_human_control           <dbl> 0
## $ pedestrian_crossing_physical_facilities     <dbl> 0
## $ light_conditions                            <dbl> 6
## $ weather_conditions                          <dbl> 1
## $ road_surface_conditions                     <dbl> 4
## $ special_conditions_at_site                  <dbl> 0
## $ carriageway_hazards                         <dbl> 3
## $ urban_or_rural_area                         <dbl> 2
## $ did_police_officer_attend_scene_of_accident <dbl> 1
## $ trunk_road_flag                             <dbl> 2
## $ lsoa_of_accident_location                   <chr> "E01032632"
## $ enhanced_severity_collision                 <dbl> -1

A reasonable choice is stored here in the pickthese variable as a vector of column indices.

pickthese <- c(8, 9, 13, 15, 16, 17, 18, 20,22, 23, 24,26:34, 37)

So, let’s create the long table from collisions_worst! If you do this on your own first, you will probably get this error:

Error in `pivot_longer()`:
! Can't combine `police_force` <double> and `local_authority_ons_district` <character>.

This means that this function cannot coerce all the values to character itself but needs you to do do that beforehand. Therefore, you must convert all these columns to character vectors before calling pivot_longer.

long <- collisions_worst %>% 
  mutate(across(all_of(pickthese), ~ as.character(.x))) %>%
  pivot_longer(
    cols = all_of(pickthese), # use all_of() if you store the selection in a variable you have defined before. It would work without but you will get a warning that it is deprecated now.  
    names_to = "field name", 
    values_to = "code/format"
               )

Now you can join the long data frame with the guide data frame, but not to clutter the table with useless columns from guide, let us first select from it only field_name, code/format, and label (and save this under the same variable name).

guide <- guide %>% select(c("field name", "code/format", "label"))

Let’s call the joined data frame joined.

joined <- long %>% left_join(guide, by = c("field name", "code/format"))

We do not need the code/format column any more, so we unselect it. Now we would like to transform the field name column into column names put the values from the corresponding label column into their values. The whole data frame should then have again only one row and some more columns.

You will need the pivot_wider function. If you define the two main arguments, namely names_from and values_from, you will not be entirely happy, because you will get a few more rows than the expected one, and their values will contain many NA. You need to tell the function that the combination of columns that were not part of the long transformation belong all to one observation. In other words, you need an ID column. And you already have it: the accident_reference, or even year - any column will do that contains the same value throughout.

This would also have worked if you had worked with the entire collisions data. You would have obtained the original number of rows as well. This looks scary at first, but the explanation and solution are simple!

Let’s call this new data frame wide.

wide <- joined %>% 
  pivot_wider(names_from = "field name", 
              values_from = "label", 
              id_cols = "accident_index"
              )

Vehicles

To drill down to the individual vehicles, we actually also need to look at the vehicles_2023.csv file.

First create a data frame called vehicles_worst that will only contain vehicles involved in the worst accident (use the value stored in the accident_index_worst variable)

vehicles_worst <- read_csv("vehicles_2023.csv", show_col_types = FALSE) %>%
  filter(accident_reference == accident_index_worst)
vehicles_worst %>% glimpse()
## Rows: 3
## Columns: 34
## $ accident_index                   <chr> "2023520300610", "2023520300610", "20…
## $ accident_year                    <dbl> 2023, 2023, 2023
## $ accident_reference               <chr> "520300610", "520300610", "520300610"
## $ vehicle_reference                <dbl> 1, 2, 3
## $ vehicle_type                     <dbl> 4, 11, 11
## $ towing_and_articulation          <dbl> 0, 0, 0
## $ vehicle_manoeuvre                <dbl> 16, 16, 16
## $ vehicle_direction_from           <dbl> 4, 4, 4
## $ vehicle_direction_to             <dbl> 1, 1, 1
## $ vehicle_location_restricted_lane <dbl> 0, 0, 0
## $ junction_location                <dbl> 0, 0, 0
## $ skidding_and_overturning         <dbl> 1, 2, 1
## $ hit_object_in_carriageway        <dbl> 0, 1, 1
## $ vehicle_leaving_carriageway      <dbl> 0, 0, 0
## $ hit_object_off_carriageway       <dbl> 0, 0, 0
## $ first_point_of_impact            <dbl> 4, 1, 1
## $ vehicle_left_hand_drive          <dbl> 1, 1, 1
## $ journey_purpose_of_driver        <dbl> 6, 1, 1
## $ sex_of_driver                    <dbl> 1, 1, 1
## $ age_of_driver                    <dbl> 39, 58, 54
## $ age_band_of_driver               <dbl> 7, 9, 8
## $ engine_capacity_cc               <dbl> 955, 5132, 10837
## $ propulsion_code                  <dbl> 1, 2, 2
## $ age_of_vehicle                   <dbl> 21, 4, 5
## $ generic_make_model               <chr> "-1", "WRIGHTBUS STREETDECK", "-1"
## $ driver_imd_decile                <dbl> 3, 3, 7
## $ driver_home_area_type            <dbl> 1, 1, 2
## $ lsoa_of_driver                   <chr> "E01029291", "E01014807", "E01029267"
## $ escooter_flag                    <dbl> 0, 0, 0
## $ dir_from_e                       <chr> "NULL", "NULL", "NULL"
## $ dir_from_n                       <chr> "NULL", "NULL", "NULL"
## $ dir_to_e                         <chr> "NULL", "NULL", "NULL"
## $ dir_to_n                         <chr> "NULL", "NULL", "NULL"
## $ driver_distance_banding          <dbl> 3, 4, 3

Decode the values of closed-listed categorical variables again. First pick columns to pour together with pivot_longer. Call the selection pickthese2. These is the vector of all column names:

colnames(vehicles_worst)
##  [1] "accident_index"                   "accident_year"                   
##  [3] "accident_reference"               "vehicle_reference"               
##  [5] "vehicle_type"                     "towing_and_articulation"         
##  [7] "vehicle_manoeuvre"                "vehicle_direction_from"          
##  [9] "vehicle_direction_to"             "vehicle_location_restricted_lane"
## [11] "junction_location"                "skidding_and_overturning"        
## [13] "hit_object_in_carriageway"        "vehicle_leaving_carriageway"     
## [15] "hit_object_off_carriageway"       "first_point_of_impact"           
## [17] "vehicle_left_hand_drive"          "journey_purpose_of_driver"       
## [19] "sex_of_driver"                    "age_of_driver"                   
## [21] "age_band_of_driver"               "engine_capacity_cc"              
## [23] "propulsion_code"                  "age_of_vehicle"                  
## [25] "generic_make_model"               "driver_imd_decile"               
## [27] "driver_home_area_type"            "lsoa_of_driver"                  
## [29] "escooter_flag"                    "dir_from_e"                      
## [31] "dir_from_n"                       "dir_to_e"                        
## [33] "dir_to_n"                         "driver_distance_banding"
pickthese2 <- c(5,6,7, 8, 9, 10, 11, 12, 13, 14, 15, 16, 17, 18, 19, 23, 25, 27,29)

Create the vehicles_long data frame. Don’t forget to mutate all the pivoted columns to character vectors and use the helper function all_of with the cols argument to prevent the warning.

vehicles_long <- vehicles_worst %>% 
  mutate(across(all_of(pickthese2), ~ as.character(.x))) %>%
  pivot_longer(
    cols = all_of(pickthese2),
    names_to = "field name", 
    values_to = "code/format"
               )

Join the vehicles_long with the guide again and finally deselect the code/format column. Call the result joined_vehicles.

joined_vehicles <- vehicles_long %>%
  left_join(guide, by = c("field name", "code/format")) %>%
  select(!c("code/format"))

Finally you want to make the table wide again. Note that this time you must tell tidyr that you want a row for each vehicle, so use the vehicle_reference, accident_reference, and accident_year as a value of the id_cols argument.

Technically, you would only need the vehicle reference to obtain one row per vehicle, but the function simply trashes columns that are neither mentioned in this argument nor among the names_from and values_from vectors! So you would lose the accident reference and accident year. Call the result wide_vehicles.

wide_vehicles <- joined_vehicles %>% 
  pivot_wider(id_cols = c("vehicle_reference", 
                          "accident_reference", 
                          "accident_year"),
              names_from = "field name", 
              values_from = "label" 
              
              )
glimpse(wide_vehicles)
## Rows: 3
## Columns: 22
## $ vehicle_reference                <dbl> 1, 2, 3
## $ accident_reference               <chr> "520300610", "520300610", "520300610"
## $ accident_year                    <dbl> 2023, 2023, 2023
## $ vehicle_type                     <chr> "Motorcycle over 125cc and up to 500c…
## $ towing_and_articulation          <chr> "No tow/articulation", "No tow/articu…
## $ vehicle_manoeuvre                <chr> "Going ahead left-hand bend", "Going …
## $ vehicle_direction_from           <chr> "South East", "South East", "South Ea…
## $ vehicle_direction_to             <chr> "North", "North", "North"
## $ vehicle_location_restricted_lane <chr> "On main c'way - not in restricted la…
## $ junction_location                <chr> "Not at or within 20 metres of juncti…
## $ skidding_and_overturning         <chr> "Skidded", "Skidded and overturned", …
## $ hit_object_in_carriageway        <chr> "None", "Previous accident", "Previou…
## $ vehicle_leaving_carriageway      <chr> "Did not leave carriageway", "Did not…
## $ hit_object_off_carriageway       <chr> "None", "None", "None"
## $ first_point_of_impact            <chr> "Nearside", "Front", "Front"
## $ vehicle_left_hand_drive          <chr> "No", "No", "No"
## $ journey_purpose_of_driver        <chr> "Not known", "Journey as part of work…
## $ sex_of_driver                    <chr> "Male", "Male", "Male"
## $ propulsion_code                  <chr> "Petrol", "Heavy oil", "Heavy oil"
## $ generic_make_model               <chr> "Data missing or out of range", NA, "…
## $ driver_home_area_type            <chr> "Urban area", "Urban area", "Small to…
## $ escooter_flag                    <chr> "Vehicle was not an e-scooter", "Vehi…
knitr::kable(wide_vehicles)
vehicle_reference accident_reference accident_year vehicle_type towing_and_articulation vehicle_manoeuvre vehicle_direction_from vehicle_direction_to vehicle_location_restricted_lane junction_location skidding_and_overturning hit_object_in_carriageway vehicle_leaving_carriageway hit_object_off_carriageway first_point_of_impact vehicle_left_hand_drive journey_purpose_of_driver sex_of_driver propulsion_code generic_make_model driver_home_area_type escooter_flag
1 520300610 2023 Motorcycle over 125cc and up to 500cc No tow/articulation Going ahead left-hand bend South East North On main c’way - not in restricted lane Not at or within 20 metres of junction Skidded None Did not leave carriageway None Nearside No Not known Male Petrol Data missing or out of range Urban area Vehicle was not an e-scooter
2 520300610 2023 Bus or coach (17 or more pass seats) No tow/articulation Going ahead left-hand bend South East North On main c’way - not in restricted lane Not at or within 20 metres of junction Skidded and overturned Previous accident Did not leave carriageway None Front No Journey as part of work Male Heavy oil NA Urban area Vehicle was not an e-scooter
3 520300610 2023 Bus or coach (17 or more pass seats) No tow/articulation Going ahead left-hand bend South East North On main c’way - not in restricted lane Not at or within 20 metres of junction Skidded Previous accident Did not leave carriageway None Front No Journey as part of work Male Heavy oil Data missing or out of range Small town Vehicle was not an e-scooter

So, from these tables, you have much more information about what happened.

If you liked this exercise, you can challenge yourself to go back to the original casualties data and find the worst accident in terms of fatal casualties and replicate all this additional detective work.