This is a substitute for a live lecture. It draws on the British road safety data from 2024-11-15_22.
Read this file in its rendered (html) format and try to think yourself how you would proceed. You can sketch it by hand on paper, if you don’t trust yourself to code it all. The most important thing here is to understand why and how to manipulate the tables.
The British government published statistics of road accidents. We investigated a table where each row represented a casualty; that is, a person who died or was severely injured in an accident. Each accident had a unique ID. Each person only had a unique ID within that given accident. The table contained many other columns with details about the person (driver, passenger or pedestrian, age, sex, etc.), as well as what type of vehicle the person was on.
We extracted the worst accident by grouping the table by accident IDs and counting the rows (ie. casualties). The worst accident resulted in 70 casualties and happened in 2023. It looked like a crash of a bus/coach with a smaller vehicle.
In this session, we are going to join the casualties data with another table, which lists the accidents. It contains details about the vehicle types and some circumstances recorded by the police (e.g. weather, location, road type).
You possibly remember that we had to decode the values of categorical
variables such as sex (1
represented male
,
2
represented female
, 9
represented person did not want to tell
, and
-1
represented impossible to retrieve
or
similar). You will get the casualties data already decoded, but you will
do it with the accident data.
This exercise is a super-practical demonstration of
dplyr::..._join
and tidyr::pivot_...
in one
workflow! Enjoy!
library(readr, warn.conflicts = FALSE, quietly = TRUE)
library(dplyr, warn.conflicts = FALSE, quietly = TRUE)
library(tidyr, warn.conflicts = FALSE, quietly = TRUE)
This is what we found out previously:
worst_accident <- read_tsv("all_casualties_labeled.tsv",
show_col_types = FALSE) %>%
group_by(accident_reference) %>%
add_count() %>%
ungroup() %>%
slice_max(order_by = n) %>%
select(-n)
knitr::kable(worst_accident)
ID | rowid | accident_reference | accident_year | age_of_casualty | vehicle_reference | casualty_class | casualty_type | sex_of_casualty | casualty_severity |
---|---|---|---|---|---|---|---|---|---|
rc_2023 | 647399 | 520300610 | 2023 | 39 | 1 | Driver or rider | Motorcycle over 125cc and up to 500cc rider or passenger | Male | Slight |
rc_2023 | 647400 | 520300610 | 2023 | 58 | 2 | Driver or rider | Bus or coach occupant (17 or more pass seats) | Male | Slight |
rc_2023 | 647401 | 520300610 | 2023 | 55 | 2 | Passenger | Bus or coach occupant (17 or more pass seats) | Male | Slight |
rc_2023 | 647402 | 520300610 | 2023 | 45 | 2 | Passenger | Bus or coach occupant (17 or more pass seats) | Male | Slight |
rc_2023 | 647403 | 520300610 | 2023 | 31 | 2 | Passenger | Bus or coach occupant (17 or more pass seats) | Male | Serious |
rc_2023 | 647404 | 520300610 | 2023 | 28 | 2 | Passenger | Bus or coach occupant (17 or more pass seats) | Male | Serious |
rc_2023 | 647405 | 520300610 | 2023 | 40 | 2 | Passenger | Bus or coach occupant (17 or more pass seats) | Male | Slight |
rc_2023 | 647406 | 520300610 | 2023 | 22 | 2 | Passenger | Bus or coach occupant (17 or more pass seats) | Male | Slight |
rc_2023 | 647407 | 520300610 | 2023 | 42 | 2 | Passenger | Bus or coach occupant (17 or more pass seats) | Male | Serious |
rc_2023 | 647408 | 520300610 | 2023 | 25 | 2 | Passenger | Bus or coach occupant (17 or more pass seats) | Male | Slight |
rc_2023 | 647409 | 520300610 | 2023 | 39 | 2 | Passenger | Bus or coach occupant (17 or more pass seats) | Male | Slight |
rc_2023 | 647410 | 520300610 | 2023 | 45 | 2 | Passenger | Bus or coach occupant (17 or more pass seats) | Male | Slight |
rc_2023 | 647411 | 520300610 | 2023 | 57 | 2 | Passenger | Bus or coach occupant (17 or more pass seats) | Male | Serious |
rc_2023 | 647412 | 520300610 | 2023 | 46 | 2 | Passenger | Bus or coach occupant (17 or more pass seats) | Male | Slight |
rc_2023 | 647413 | 520300610 | 2023 | 44 | 2 | Passenger | Bus or coach occupant (17 or more pass seats) | Male | Slight |
rc_2023 | 647414 | 520300610 | 2023 | 43 | 2 | Passenger | Bus or coach occupant (17 or more pass seats) | Male | Slight |
rc_2023 | 647415 | 520300610 | 2023 | 34 | 2 | Passenger | Bus or coach occupant (17 or more pass seats) | Male | Serious |
rc_2023 | 647416 | 520300610 | 2023 | 29 | 2 | Passenger | Bus or coach occupant (17 or more pass seats) | Male | Slight |
rc_2023 | 647417 | 520300610 | 2023 | 28 | 2 | Passenger | Bus or coach occupant (17 or more pass seats) | Male | Slight |
rc_2023 | 647418 | 520300610 | 2023 | 41 | 2 | Passenger | Bus or coach occupant (17 or more pass seats) | Male | Slight |
rc_2023 | 647419 | 520300610 | 2023 | 63 | 2 | Passenger | Bus or coach occupant (17 or more pass seats) | Male | Slight |
rc_2023 | 647420 | 520300610 | 2023 | 54 | 2 | Passenger | Bus or coach occupant (17 or more pass seats) | Male | Slight |
rc_2023 | 647421 | 520300610 | 2023 | 39 | 2 | Passenger | Bus or coach occupant (17 or more pass seats) | Male | Slight |
rc_2023 | 647422 | 520300610 | 2023 | 42 | 2 | Passenger | Bus or coach occupant (17 or more pass seats) | Male | Slight |
rc_2023 | 647423 | 520300610 | 2023 | 31 | 2 | Passenger | Bus or coach occupant (17 or more pass seats) | Male | Slight |
rc_2023 | 647424 | 520300610 | 2023 | 64 | 2 | Passenger | Bus or coach occupant (17 or more pass seats) | Male | Slight |
rc_2023 | 647425 | 520300610 | 2023 | 27 | 2 | Passenger | Bus or coach occupant (17 or more pass seats) | Male | Slight |
rc_2023 | 647426 | 520300610 | 2023 | 38 | 2 | Passenger | Bus or coach occupant (17 or more pass seats) | Male | Serious |
rc_2023 | 647427 | 520300610 | 2023 | 31 | 2 | Passenger | Bus or coach occupant (17 or more pass seats) | Male | Slight |
rc_2023 | 647428 | 520300610 | 2023 | 52 | 2 | Passenger | Bus or coach occupant (17 or more pass seats) | Male | Serious |
rc_2023 | 647429 | 520300610 | 2023 | 25 | 2 | Passenger | Bus or coach occupant (17 or more pass seats) | Male | Slight |
rc_2023 | 647430 | 520300610 | 2023 | 27 | 2 | Passenger | Bus or coach occupant (17 or more pass seats) | Female | Slight |
rc_2023 | 647431 | 520300610 | 2023 | 54 | 2 | Passenger | Bus or coach occupant (17 or more pass seats) | Male | Slight |
rc_2023 | 647432 | 520300610 | 2023 | 30 | 2 | Passenger | Bus or coach occupant (17 or more pass seats) | Male | Serious |
rc_2023 | 647433 | 520300610 | 2023 | 40 | 2 | Passenger | Bus or coach occupant (17 or more pass seats) | Female | Serious |
rc_2023 | 647434 | 520300610 | 2023 | 45 | 2 | Passenger | Bus or coach occupant (17 or more pass seats) | Male | Slight |
rc_2023 | 647435 | 520300610 | 2023 | 42 | 2 | Passenger | Bus or coach occupant (17 or more pass seats) | Male | Slight |
rc_2023 | 647436 | 520300610 | 2023 | 50 | 2 | Passenger | Bus or coach occupant (17 or more pass seats) | Male | Slight |
rc_2023 | 647437 | 520300610 | 2023 | 35 | 2 | Passenger | Bus or coach occupant (17 or more pass seats) | Male | Slight |
rc_2023 | 647438 | 520300610 | 2023 | 35 | 2 | Passenger | Bus or coach occupant (17 or more pass seats) | Male | Serious |
rc_2023 | 647439 | 520300610 | 2023 | 30 | 2 | Passenger | Bus or coach occupant (17 or more pass seats) | Male | Serious |
rc_2023 | 647440 | 520300610 | 2023 | 41 | 2 | Passenger | Bus or coach occupant (17 or more pass seats) | Male | Slight |
rc_2023 | 647441 | 520300610 | 2023 | 50 | 2 | Passenger | Bus or coach occupant (17 or more pass seats) | Male | Slight |
rc_2023 | 647442 | 520300610 | 2023 | 27 | 2 | Passenger | Bus or coach occupant (17 or more pass seats) | Male | Serious |
rc_2023 | 647443 | 520300610 | 2023 | 43 | 2 | Passenger | Bus or coach occupant (17 or more pass seats) | Male | Slight |
rc_2023 | 647444 | 520300610 | 2023 | 34 | 2 | Passenger | Bus or coach occupant (17 or more pass seats) | Male | Slight |
rc_2023 | 647445 | 520300610 | 2023 | 39 | 2 | Passenger | Bus or coach occupant (17 or more pass seats) | Male | Slight |
rc_2023 | 647446 | 520300610 | 2023 | 41 | 2 | Passenger | Bus or coach occupant (17 or more pass seats) | Male | Slight |
rc_2023 | 647447 | 520300610 | 2023 | 30 | 2 | Passenger | Bus or coach occupant (17 or more pass seats) | Male | Serious |
rc_2023 | 647448 | 520300610 | 2023 | 25 | 2 | Passenger | Bus or coach occupant (17 or more pass seats) | Female | Serious |
rc_2023 | 647449 | 520300610 | 2023 | 39 | 2 | Passenger | Bus or coach occupant (17 or more pass seats) | Male | Serious |
rc_2023 | 647450 | 520300610 | 2023 | 31 | 2 | Passenger | Bus or coach occupant (17 or more pass seats) | Male | Slight |
rc_2023 | 647451 | 520300610 | 2023 | 34 | 2 | Passenger | Bus or coach occupant (17 or more pass seats) | Male | Slight |
rc_2023 | 647452 | 520300610 | 2023 | 36 | 2 | Passenger | Bus or coach occupant (17 or more pass seats) | Male | Slight |
rc_2023 | 647453 | 520300610 | 2023 | 52 | 2 | Passenger | Bus or coach occupant (17 or more pass seats) | Male | Slight |
rc_2023 | 647454 | 520300610 | 2023 | 55 | 2 | Passenger | Bus or coach occupant (17 or more pass seats) | Male | Slight |
rc_2023 | 647455 | 520300610 | 2023 | 61 | 2 | Passenger | Bus or coach occupant (17 or more pass seats) | Male | Serious |
rc_2023 | 647456 | 520300610 | 2023 | 23 | 2 | Passenger | Bus or coach occupant (17 or more pass seats) | Male | Slight |
rc_2023 | 647457 | 520300610 | 2023 | 35 | 2 | Passenger | Bus or coach occupant (17 or more pass seats) | Male | Slight |
rc_2023 | 647458 | 520300610 | 2023 | 36 | 2 | Passenger | Bus or coach occupant (17 or more pass seats) | Male | Serious |
rc_2023 | 647459 | 520300610 | 2023 | 49 | 2 | Passenger | Bus or coach occupant (17 or more pass seats) | Male | Slight |
rc_2023 | 647460 | 520300610 | 2023 | 25 | 2 | Passenger | Bus or coach occupant (17 or more pass seats) | Male | Slight |
rc_2023 | 647461 | 520300610 | 2023 | 61 | 2 | Passenger | Bus or coach occupant (17 or more pass seats) | Male | Slight |
rc_2023 | 647462 | 520300610 | 2023 | 27 | 2 | Passenger | Bus or coach occupant (17 or more pass seats) | Male | Serious |
rc_2023 | 647463 | 520300610 | 2023 | 24 | 2 | Passenger | Bus or coach occupant (17 or more pass seats) | Male | Slight |
rc_2023 | 647464 | 520300610 | 2023 | 34 | 2 | Passenger | Bus or coach occupant (17 or more pass seats) | Male | Slight |
rc_2023 | 647465 | 520300610 | 2023 | 20 | 2 | Passenger | Bus or coach occupant (17 or more pass seats) | Male | Slight |
rc_2023 | 647466 | 520300610 | 2023 | 31 | 2 | Passenger | Bus or coach occupant (17 or more pass seats) | Male | Slight |
rc_2023 | 647467 | 520300610 | 2023 | 53 | 2 | Passenger | Bus or coach occupant (17 or more pass seats) | Male | Slight |
rc_2023 | 647468 | 520300610 | 2023 | 31 | 2 | Passenger | Bus or coach occupant (17 or more pass seats) | Male | Slight |
Try and make sense of the data by grouping it by
vehicle_reference
, casualty_class
,
casualty_type
, casualty_severity
, and
accident_year
. Of course they are all going to have the
same accident_year
, but you will not lose the information
which year it is.
worst_accident %>%
group_by(vehicle_reference,
casualty_class,
casualty_type,
casualty_severity,
accident_year) %>%
count() %>%
knitr::kable()
vehicle_reference | casualty_class | casualty_type | casualty_severity | accident_year | n |
---|---|---|---|---|---|
1 | Driver or rider | Motorcycle over 125cc and up to 500cc rider or passenger | Slight | 2023 | 1 |
2 | Driver or rider | Bus or coach occupant (17 or more pass seats) | Slight | 2023 | 1 |
2 | Passenger | Bus or coach occupant (17 or more pass seats) | Serious | 2023 | 18 |
2 | Passenger | Bus or coach occupant (17 or more pass seats) | Slight | 2023 | 50 |
It looks as if a motorbike crashed with a bus/coach. Fortunately, nobody died in this accident.
Collisions
tableThe file all_about_the_accident.csv
contains the
relevant data from the 2023 Collision, so read it in and use the new
information to describe the accident in more detail.
all_about <- readr::read_csv("all_about_the_accident.csv", show_col_types = FALSE)
knitr::kable( all_about ) # knitr::kable just renders the table prettier
rowid | vehicle_reference | vehicle_type | towing_and_articulation | vehicle_manoeuvre | vehicle_direction_from | vehicle_direction_to | vehicle_location_restricted_lane | junction_location | skidding_and_overturning | hit_object_in_carriageway | vehicle_leaving_carriageway | hit_object_off_carriageway | first_point_of_impact | vehicle_left_hand_drive | journey_purpose_of_driver | sex_of_driver | age_of_driver | age_band_of_driver | engine_capacity_cc | propulsion_code | age_of_vehicle | generic_make_model | driver_imd_decile | driver_home_area_type | lsoa_of_driver | escooter_flag | driver_distance_banding |
---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
1 | 1 | Motorcycle over 125cc and up to 500cc | No tow/articulation | Going ahead left-hand bend | South East | North | On main c’way - not in restricted lane | Not at or within 20 metres of junction | Skidded | None | Did not leave carriageway | None | Nearside | No | Not known | Male | 39 | 36 - 45 | 955 | Petrol | 21 | Data missing or out of range | More deprived 20-30% | Urban area | E01029291 | Vehicle was not an e-scooter | Collision occurred between 10.001 and 20km of drivers home postcode |
2 | 2 | Bus or coach (17 or more pass seats) | No tow/articulation | Going ahead left-hand bend | South East | North | On main c’way - not in restricted lane | Not at or within 20 metres of junction | Skidded and overturned | Previous accident | Did not leave carriageway | None | Front | No | Journey as part of work | Male | 58 | 56 - 65 | 5132 | Heavy oil | 4 | WRIGHTBUS STREETDECK | More deprived 20-30% | Urban area | E01014807 | Vehicle was not an e-scooter | Collision occurred between 20.001 and 100km of drivers home postcode |
3 | 3 | Bus or coach (17 or more pass seats) | No tow/articulation | Going ahead left-hand bend | South East | North | On main c’way - not in restricted lane | Not at or within 20 metres of junction | Skidded | Previous accident | Did not leave carriageway | None | Front | No | Journey as part of work | Male | 54 | 46 - 55 | 10837 | Heavy oil | 5 | Data missing or out of range | Less deprived 30-40% | Small town | E01029267 | Vehicle was not an e-scooter | Collision occurred between 10.001 and 20km of drivers home postcode |
Clue in the code snippet below.
# We had to revise the number of vehicles involved, hadn't we?
# It was a motorbike and TWO busses. We couldn't know about the third one from the casualties data because Vehicle 3 contributed with no casualties. And we did not suspect there was a third like we would have had if there were casualties on the third vehicle but none on the second or first.
# What happened: the motorbike driver skidded (slid) on the surface when bending left on a c-road. It was not a junction. In the same direction followed a bus (Vehicle 2). They probably crashed so that the biker landed in front of the bus, slightly turned orthogonally, so that the bus bumped into its front side with its own front. Or the bike driver fell off his motorbike. The driver was probably doing his best to avoid overrunning the driver and probably hit just the motorbike (hit_object_in_carriageway - driver has none, this looks like falling off, no? The bus has Previous accident, that will be the motorbike ). Anyway, the bus also skidded and even overturned. Another bus (Vehicle 3) finally hit Vehicle 2 or Vehicle 1. All casualties except the biker were sitting on Vehicle 2, that is, the earlier bus that overturned.
# The motorbike was rather old, the buses rather new. None of the drivers was extremely young or extremely old. We don't know whether anyone was to blame.
Now, we could of course learn more about the casualties just by eyeballing, but how about adding the casualty severity of drivers to this table?
First, make a drivers
table by filtering just drivers
from the worst_accident
table and select the columns
casualty_class
, casualty_severity
,
accident_reference
, and vehicle_reference
.
drivers <- worst_accident %>%
filter(casualty_class == "Driver or rider") %>%
select(casualty_class, casualty_severity, accident_reference, vehicle_reference, )
drivers %>% knitr::kable()
casualty_class | casualty_severity | accident_reference | vehicle_reference |
---|---|---|---|
Driver or rider | Slight | 520300610 | 1 |
Driver or rider | Slight | 520300610 | 2 |
Join the drivers
and all_about
tables so
that it contains all vehicles. You can just print it out without saving
in a variable.
drivers %>% full_join(all_about, by = "vehicle_reference") %>%
knitr::kable()
casualty_class | casualty_severity | accident_reference | vehicle_reference | rowid | vehicle_type | towing_and_articulation | vehicle_manoeuvre | vehicle_direction_from | vehicle_direction_to | vehicle_location_restricted_lane | junction_location | skidding_and_overturning | hit_object_in_carriageway | vehicle_leaving_carriageway | hit_object_off_carriageway | first_point_of_impact | vehicle_left_hand_drive | journey_purpose_of_driver | sex_of_driver | age_of_driver | age_band_of_driver | engine_capacity_cc | propulsion_code | age_of_vehicle | generic_make_model | driver_imd_decile | driver_home_area_type | lsoa_of_driver | escooter_flag | driver_distance_banding |
---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
Driver or rider | Slight | 520300610 | 1 | 1 | Motorcycle over 125cc and up to 500cc | No tow/articulation | Going ahead left-hand bend | South East | North | On main c’way - not in restricted lane | Not at or within 20 metres of junction | Skidded | None | Did not leave carriageway | None | Nearside | No | Not known | Male | 39 | 36 - 45 | 955 | Petrol | 21 | Data missing or out of range | More deprived 20-30% | Urban area | E01029291 | Vehicle was not an e-scooter | Collision occurred between 10.001 and 20km of drivers home postcode |
Driver or rider | Slight | 520300610 | 2 | 2 | Bus or coach (17 or more pass seats) | No tow/articulation | Going ahead left-hand bend | South East | North | On main c’way - not in restricted lane | Not at or within 20 metres of junction | Skidded and overturned | Previous accident | Did not leave carriageway | None | Front | No | Journey as part of work | Male | 58 | 56 - 65 | 5132 | Heavy oil | 4 | WRIGHTBUS STREETDECK | More deprived 20-30% | Urban area | E01014807 | Vehicle was not an e-scooter | Collision occurred between 20.001 and 100km of drivers home postcode |
NA | NA | NA | 3 | 3 | Bus or coach (17 or more pass seats) | No tow/articulation | Going ahead left-hand bend | South East | North | On main c’way - not in restricted lane | Not at or within 20 metres of junction | Skidded | Previous accident | Did not leave carriageway | None | Front | No | Journey as part of work | Male | 54 | 46 - 55 | 10837 | Heavy oil | 5 | Data missing or out of range | Less deprived 30-40% | Small town | E01029267 | Vehicle was not an e-scooter | Collision occurred between 10.001 and 20km of drivers home postcode |
guide.xlsx
Now we will concentrate on how to obtain the convenient
all_about_the_accident.csv
from the data that the British
government has actually published.
To load the original data, read in collisions_2023.csv
and save it in a variable called collisions
.
collisions <- read_csv("collisions_2023.csv")
## Rows: 104258 Columns: 37
## ── Column specification ────────────────────────────────────────────────────────
## Delimiter: ","
## chr (10): accident_index, accident_reference, location_easting_osgr, locati...
## dbl (26): accident_year, police_force, accident_severity, number_of_vehicle...
## time (1): time
##
## ℹ Use `spec()` to retrieve the full column specification for this data.
## ℹ Specify the column types or set `show_col_types = FALSE` to quiet this message.
Now glimpse it.
glimpse(collisions)
## Rows: 104,258
## Columns: 37
## $ accident_index <chr> "2023010419171", "20230104…
## $ accident_year <dbl> 2023, 2023, 2023, 2023, 20…
## $ accident_reference <chr> "010419171", "010419183", …
## $ location_easting_osgr <chr> "525060", "535463", "50870…
## $ location_northing_osgr <chr> "170416", "198745", "17769…
## $ longitude <chr> "-0.202878", "-0.042464", …
## $ latitude <chr> "51.418974", "51.671155", …
## $ police_force <dbl> 1, 1, 1, 1, 1, 1, 1, 1, 1,…
## $ accident_severity <dbl> 3, 3, 3, 3, 3, 3, 3, 3, 3,…
## $ number_of_vehicles <dbl> 1, 3, 2, 2, 2, 1, 2, 1, 1,…
## $ number_of_casualties <dbl> 1, 2, 1, 1, 1, 1, 1, 1, 1,…
## $ date <chr> "01/01/2023", "01/01/2023"…
## $ day_of_week <dbl> 1, 1, 1, 1, 1, 1, 1, 1, 1,…
## $ time <time> 01:24:00, 02:25:00, 03:50…
## $ local_authority_district <dbl> -1, -1, -1, -1, -1, -1, -1…
## $ local_authority_ons_district <chr> "E09000024", "E09000010", …
## $ local_authority_highway <chr> "E09000024", "E09000010", …
## $ first_road_class <dbl> 5, 6, 3, 3, 3, 3, 3, 3, 3,…
## $ first_road_number <dbl> 0, 0, 437, 5, 3220, 5, 3, …
## $ road_type <dbl> 2, 6, 1, 6, 6, 6, 6, 3, 6,…
## $ speed_limit <dbl> 20, 30, 30, 30, 30, 30, 20…
## $ junction_detail <dbl> 9, 3, 1, 3, 8, 0, 6, 0, 0,…
## $ junction_control <dbl> 4, 4, 4, 4, 4, -1, 2, -1, …
## $ second_road_class <dbl> 5, 6, 6, 6, 6, 0, 3, 0, 0,…
## $ second_road_number <dbl> 0, 0, 0, 0, 0, -1, 3202, -…
## $ pedestrian_crossing_human_control <dbl> 2, 0, 0, 0, 0, 2, 0, 0, 0,…
## $ pedestrian_crossing_physical_facilities <dbl> 5, 1, 0, 0, 0, 5, 0, 0, 4,…
## $ light_conditions <dbl> 4, 4, 4, 4, 4, 4, 4, 1, 4,…
## $ weather_conditions <dbl> 8, 1, 1, 9, 1, 1, 1, 1, 1,…
## $ road_surface_conditions <dbl> 2, 1, 1, 1, 1, 1, 2, 1, 1,…
## $ special_conditions_at_site <dbl> 0, 0, 0, 0, 0, 0, 0, 0, 0,…
## $ carriageway_hazards <dbl> 0, 0, 0, 0, 0, 0, 0, 0, 0,…
## $ urban_or_rural_area <dbl> 1, 1, 1, 1, 1, 1, 1, 2, 1,…
## $ did_police_officer_attend_scene_of_accident <dbl> 1, 1, 1, 1, 1, 1, 1, 1, 1,…
## $ trunk_road_flag <dbl> 2, 2, 2, 2, 2, 2, 2, 2, 2,…
## $ lsoa_of_accident_location <chr> "E01003383", "E01001547", …
## $ enhanced_severity_collision <dbl> -1, -1, -1, -1, -1, -1, -1…
Values of obviously categorical variables are encoded with digits
just like we saw it in the original casualties
tables. Use
the readxl
library to read in the first worksheet of the
excel file guide.xlsx
and call the variable
guide
. And glimpse it. Or better, just open the xlsx file
in Excel and look at it with all convenience.
library(readxl)
## Warning: package 'readxl' was built under R version 4.4.2
guide <- read_xlsx("guide.xlsx")
glimpse(guide)
## Rows: 1,785
## Columns: 5
## $ table <chr> "accident", "accident", "accident", "accident", "acciden…
## $ `field name` <chr> "collision_index", "collision_year", "collision_referenc…
## $ `code/format` <chr> NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, "1", "3", "4", "…
## $ label <chr> NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, "Metropolitan Po…
## $ note <chr> "unique value for each accident. The accident_index comb…
Obviously, the values in the column field name
at least
partly correspond to names of columns in the collisions
data frame. The values in the column code/format
correspond
to values actually spotted in the collisions
data
frame.
If we transformed the collisions
data frame to a long
format with a column of column names and a column of
code/format
values of these column names, and if we kept
the label
column, we could join collisions
with guide
on the columns field name
and
code/format
!!
Now, which columns should occur in the field name
column? If we pick all, the long format could be really long (the
current number of rows times the number of columns times the number of
their observed values!), so it makes some sense to make a sensible
selection. Below are the column names again, and you are welcome to read
the note column in the Excel spreadsheet (in Excel) to make up your
mind.
To be super sure that we don’t get very long data, we extract just
the rows of collisions that represent the accident we focus on and save
this selection in a variable called collisions_worst
. First
save the value of accident_reference
from
worst_accident
as one number in a variable called
accident_index_worst
and then use this variable for
filtering.
accident_index_worst <- worst_accident %>%
distinct(accident_reference) %>%
pull()
collisions_worst <- collisions %>%
filter(accident_reference == accident_index_worst)
Now glimpse collisions_worst
glimpse(collisions_worst)
## Rows: 1
## Columns: 37
## $ accident_index <chr> "2023520300610"
## $ accident_year <dbl> 2023
## $ accident_reference <chr> "520300610"
## $ location_easting_osgr <chr> "326970"
## $ location_northing_osgr <chr> "137916"
## $ longitude <chr> "-3.045176"
## $ latitude <chr> "51.135841"
## $ police_force <dbl> 52
## $ accident_severity <dbl> 2
## $ number_of_vehicles <dbl> 3
## $ number_of_casualties <dbl> 70
## $ date <chr> "17/01/2023"
## $ day_of_week <dbl> 3
## $ time <time> 06:00:00
## $ local_authority_district <dbl> -1
## $ local_authority_ons_district <chr> "E06000066"
## $ local_authority_highway <chr> "E06000066"
## $ first_road_class <dbl> 3
## $ first_road_number <dbl> 39
## $ road_type <dbl> 6
## $ speed_limit <dbl> 40
## $ junction_detail <dbl> 0
## $ junction_control <dbl> -1
## $ second_road_class <dbl> 0
## $ second_road_number <dbl> -1
## $ pedestrian_crossing_human_control <dbl> 0
## $ pedestrian_crossing_physical_facilities <dbl> 0
## $ light_conditions <dbl> 6
## $ weather_conditions <dbl> 1
## $ road_surface_conditions <dbl> 4
## $ special_conditions_at_site <dbl> 0
## $ carriageway_hazards <dbl> 3
## $ urban_or_rural_area <dbl> 2
## $ did_police_officer_attend_scene_of_accident <dbl> 1
## $ trunk_road_flag <dbl> 2
## $ lsoa_of_accident_location <chr> "E01032632"
## $ enhanced_severity_collision <dbl> -1
A reasonable choice is stored here in the pickthese
variable as a vector of column indices.
pickthese <- c(8, 9, 13, 15, 16, 17, 18, 20,22, 23, 24,26:34, 37)
So, let’s create the long
table from
collisions_worst
! If you do this on your own first, you
will probably get this error:
Error in `pivot_longer()`:
! Can't combine `police_force` <double> and `local_authority_ons_district` <character>.
This means that this function cannot coerce all the values to
character itself but needs you to do do that beforehand. Therefore, you
must convert all these columns to character vectors before calling
pivot_longer
.
long <- collisions_worst %>%
mutate(across(all_of(pickthese), ~ as.character(.x))) %>%
pivot_longer(
cols = all_of(pickthese), # use all_of() if you store the selection in a variable you have defined before. It would work without but you will get a warning that it is deprecated now.
names_to = "field name",
values_to = "code/format"
)
Now you can join the long
data frame with the
guide
data frame, but not to clutter the table with useless
columns from guide
, let us first select from it only
field_name
, code/format
, and
label
(and save this under the same variable name).
guide <- guide %>% select(c("field name", "code/format", "label"))
Let’s call the joined data frame joined
.
joined <- long %>% left_join(guide, by = c("field name", "code/format"))
We do not need the code/format
column any more, so we
unselect it. Now we would like to transform the field name
column into column names put the values from the corresponding
label
column into their values. The whole data frame should
then have again only one row and some more columns.
You will need the pivot_wider
function. If you define
the two main arguments, namely names_from
and
values_from
, you will not be entirely happy, because you
will get a few more rows than the expected one, and their values will
contain many NA
. You need to tell the function that the
combination of columns that were not part of the long transformation
belong all to one observation. In other words, you need an ID column.
And you already have it: the accident_reference
, or even
year
- any column will do that contains the same value
throughout.
This would also have worked if you had worked with the entire
collisions
data. You would have obtained the original
number of rows as well. This looks scary at first, but the explanation
and solution are simple!
Let’s call this new data frame wide
.
wide <- joined %>%
pivot_wider(names_from = "field name",
values_from = "label",
id_cols = "accident_index"
)
To drill down to the individual vehicles, we actually also need to
look at the vehicles_2023.csv
file.
First create a data frame called vehicles_worst
that
will only contain vehicles involved in the worst accident (use the value
stored in the accident_index_worst
variable)
vehicles_worst <- read_csv("vehicles_2023.csv", show_col_types = FALSE) %>%
filter(accident_reference == accident_index_worst)
vehicles_worst %>% glimpse()
## Rows: 3
## Columns: 34
## $ accident_index <chr> "2023520300610", "2023520300610", "20…
## $ accident_year <dbl> 2023, 2023, 2023
## $ accident_reference <chr> "520300610", "520300610", "520300610"
## $ vehicle_reference <dbl> 1, 2, 3
## $ vehicle_type <dbl> 4, 11, 11
## $ towing_and_articulation <dbl> 0, 0, 0
## $ vehicle_manoeuvre <dbl> 16, 16, 16
## $ vehicle_direction_from <dbl> 4, 4, 4
## $ vehicle_direction_to <dbl> 1, 1, 1
## $ vehicle_location_restricted_lane <dbl> 0, 0, 0
## $ junction_location <dbl> 0, 0, 0
## $ skidding_and_overturning <dbl> 1, 2, 1
## $ hit_object_in_carriageway <dbl> 0, 1, 1
## $ vehicle_leaving_carriageway <dbl> 0, 0, 0
## $ hit_object_off_carriageway <dbl> 0, 0, 0
## $ first_point_of_impact <dbl> 4, 1, 1
## $ vehicle_left_hand_drive <dbl> 1, 1, 1
## $ journey_purpose_of_driver <dbl> 6, 1, 1
## $ sex_of_driver <dbl> 1, 1, 1
## $ age_of_driver <dbl> 39, 58, 54
## $ age_band_of_driver <dbl> 7, 9, 8
## $ engine_capacity_cc <dbl> 955, 5132, 10837
## $ propulsion_code <dbl> 1, 2, 2
## $ age_of_vehicle <dbl> 21, 4, 5
## $ generic_make_model <chr> "-1", "WRIGHTBUS STREETDECK", "-1"
## $ driver_imd_decile <dbl> 3, 3, 7
## $ driver_home_area_type <dbl> 1, 1, 2
## $ lsoa_of_driver <chr> "E01029291", "E01014807", "E01029267"
## $ escooter_flag <dbl> 0, 0, 0
## $ dir_from_e <chr> "NULL", "NULL", "NULL"
## $ dir_from_n <chr> "NULL", "NULL", "NULL"
## $ dir_to_e <chr> "NULL", "NULL", "NULL"
## $ dir_to_n <chr> "NULL", "NULL", "NULL"
## $ driver_distance_banding <dbl> 3, 4, 3
Decode the values of closed-listed categorical variables again. First
pick columns to pour together with pivot_longer
. Call the
selection pickthese2
. These is the vector of all column
names:
colnames(vehicles_worst)
## [1] "accident_index" "accident_year"
## [3] "accident_reference" "vehicle_reference"
## [5] "vehicle_type" "towing_and_articulation"
## [7] "vehicle_manoeuvre" "vehicle_direction_from"
## [9] "vehicle_direction_to" "vehicle_location_restricted_lane"
## [11] "junction_location" "skidding_and_overturning"
## [13] "hit_object_in_carriageway" "vehicle_leaving_carriageway"
## [15] "hit_object_off_carriageway" "first_point_of_impact"
## [17] "vehicle_left_hand_drive" "journey_purpose_of_driver"
## [19] "sex_of_driver" "age_of_driver"
## [21] "age_band_of_driver" "engine_capacity_cc"
## [23] "propulsion_code" "age_of_vehicle"
## [25] "generic_make_model" "driver_imd_decile"
## [27] "driver_home_area_type" "lsoa_of_driver"
## [29] "escooter_flag" "dir_from_e"
## [31] "dir_from_n" "dir_to_e"
## [33] "dir_to_n" "driver_distance_banding"
pickthese2 <- c(5,6,7, 8, 9, 10, 11, 12, 13, 14, 15, 16, 17, 18, 19, 23, 25, 27,29)
Create the vehicles_long
data frame. Don’t forget to
mutate all the pivoted columns to character vectors and use the helper
function all_of
with the cols
argument to
prevent the warning.
vehicles_long <- vehicles_worst %>%
mutate(across(all_of(pickthese2), ~ as.character(.x))) %>%
pivot_longer(
cols = all_of(pickthese2),
names_to = "field name",
values_to = "code/format"
)
Join the vehicles_long
with the guide
again
and finally deselect the code/format
column. Call the
result joined_vehicles
.
joined_vehicles <- vehicles_long %>%
left_join(guide, by = c("field name", "code/format")) %>%
select(!c("code/format"))
Finally you want to make the table wide again. Note that this time
you must tell tidyr that you want a row for each vehicle, so
use the vehicle_reference
, accident_reference
,
and accident_year
as a value of the id_cols
argument.
Technically, you would only need the vehicle reference to obtain one
row per vehicle, but the function simply trashes columns that are
neither mentioned in this argument nor among the names_from
and values_from
vectors! So you would lose the accident
reference and accident year. Call the result
wide_vehicles
.
wide_vehicles <- joined_vehicles %>%
pivot_wider(id_cols = c("vehicle_reference",
"accident_reference",
"accident_year"),
names_from = "field name",
values_from = "label"
)
glimpse(wide_vehicles)
## Rows: 3
## Columns: 22
## $ vehicle_reference <dbl> 1, 2, 3
## $ accident_reference <chr> "520300610", "520300610", "520300610"
## $ accident_year <dbl> 2023, 2023, 2023
## $ vehicle_type <chr> "Motorcycle over 125cc and up to 500c…
## $ towing_and_articulation <chr> "No tow/articulation", "No tow/articu…
## $ vehicle_manoeuvre <chr> "Going ahead left-hand bend", "Going …
## $ vehicle_direction_from <chr> "South East", "South East", "South Ea…
## $ vehicle_direction_to <chr> "North", "North", "North"
## $ vehicle_location_restricted_lane <chr> "On main c'way - not in restricted la…
## $ junction_location <chr> "Not at or within 20 metres of juncti…
## $ skidding_and_overturning <chr> "Skidded", "Skidded and overturned", …
## $ hit_object_in_carriageway <chr> "None", "Previous accident", "Previou…
## $ vehicle_leaving_carriageway <chr> "Did not leave carriageway", "Did not…
## $ hit_object_off_carriageway <chr> "None", "None", "None"
## $ first_point_of_impact <chr> "Nearside", "Front", "Front"
## $ vehicle_left_hand_drive <chr> "No", "No", "No"
## $ journey_purpose_of_driver <chr> "Not known", "Journey as part of work…
## $ sex_of_driver <chr> "Male", "Male", "Male"
## $ propulsion_code <chr> "Petrol", "Heavy oil", "Heavy oil"
## $ generic_make_model <chr> "Data missing or out of range", NA, "…
## $ driver_home_area_type <chr> "Urban area", "Urban area", "Small to…
## $ escooter_flag <chr> "Vehicle was not an e-scooter", "Vehi…
knitr::kable(wide_vehicles)
vehicle_reference | accident_reference | accident_year | vehicle_type | towing_and_articulation | vehicle_manoeuvre | vehicle_direction_from | vehicle_direction_to | vehicle_location_restricted_lane | junction_location | skidding_and_overturning | hit_object_in_carriageway | vehicle_leaving_carriageway | hit_object_off_carriageway | first_point_of_impact | vehicle_left_hand_drive | journey_purpose_of_driver | sex_of_driver | propulsion_code | generic_make_model | driver_home_area_type | escooter_flag |
---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
1 | 520300610 | 2023 | Motorcycle over 125cc and up to 500cc | No tow/articulation | Going ahead left-hand bend | South East | North | On main c’way - not in restricted lane | Not at or within 20 metres of junction | Skidded | None | Did not leave carriageway | None | Nearside | No | Not known | Male | Petrol | Data missing or out of range | Urban area | Vehicle was not an e-scooter |
2 | 520300610 | 2023 | Bus or coach (17 or more pass seats) | No tow/articulation | Going ahead left-hand bend | South East | North | On main c’way - not in restricted lane | Not at or within 20 metres of junction | Skidded and overturned | Previous accident | Did not leave carriageway | None | Front | No | Journey as part of work | Male | Heavy oil | NA | Urban area | Vehicle was not an e-scooter |
3 | 520300610 | 2023 | Bus or coach (17 or more pass seats) | No tow/articulation | Going ahead left-hand bend | South East | North | On main c’way - not in restricted lane | Not at or within 20 metres of junction | Skidded | Previous accident | Did not leave carriageway | None | Front | No | Journey as part of work | Male | Heavy oil | Data missing or out of range | Small town | Vehicle was not an e-scooter |
So, from these tables, you have much more information about what happened.
If you liked this exercise, you can challenge yourself to go back to the original casualties data and find the worst accident in terms of fatal casualties and replicate all this additional detective work.