This notebook analyzes a publicly available dataset on Google Cloud, bigquery-public-data.san_francisco.bikeshare_trips
, from Bay Wheels. Part 1 focuses on understanding commuter trips (what Bay Wheels is most commonly used for). Part 2 (the more extensive part of the notebook) focuses on composing and addressing questions that showcase opportunities for business growth. In total I make four recommendations, articulated below.
import matplotlib.pyplot as plt
A commute is defined as "a regular journey of some distance to and from your place of work", and a commuter can be thought of as an individual participating in a commute. With respect to Bay Wheels trips as it relates to the previously stated definition, I require that a "commuter trip" consists of the following properties:
1) The type of subscriber must be a "Subscriber", not a "Customer".
2) The time of the trip must be during a regular morning or evening time (heading to and from work, respectively. The exact times of commutes is determined below).
3) Be at least 60 seconds long (to rule out bad data and to ensure that the trip was of a sufficient distance).
4) The day of the week must be a weekday (further verified below).
Using intuition that subscribers are likely going to be commuters, we can determine the most common days and times that subscribers use Bay Wheels.
%%bigquery commuter_days
SELECT count(trip_id) AS ts, (EXTRACT (DAYOFWEEK FROM start_date) ) AS tday
FROM `bigquery-public-data.san_francisco.bikeshare_trips`
WHERE subscriber_type = 'Subscriber' AND duration_sec > 60
GROUP BY tday
ORDER BY ts DESC
commuter_days
It is apparent that "Subscribers" are riding on weekdays rather than weekends. Now let's examine the hours in which they are riding.
%%bigquery commuter_hours
SELECT count(trip_id) AS ts, (EXTRACT (hour FROM start_date) ) AS thour
FROM `bigquery-public-data.san_francisco.bikeshare_trips`
WHERE subscriber_type = 'Subscriber' AND duration_sec > 60
GROUP BY thour
ORDER BY ts DESC
commuter_hours.head(6)
Note the two main groups of times
plt.bar(commuter_hours['thour'], commuter_hours['ts'])
plt.title('Number of Trips vs Hour of Day')
plt.xlabel('Hour of Day')
plt.ylabel("Number of Trips")
These findings back intuition that the most common days of the week and times of day for commutes are weekdays during the hours from 7 AM (inclusive) - 10 AM (exclusive) and 4 PM (inclusive) to 7 PM (exclusive).
The query below utilizes all four of the previously mentioned criteria, including the exact times of day that are most popular for subscribers to ride during. The five most common start and end stations as well as the number of trips that start and end at each one is output in the dataframe below.
%%bigquery commuter_trips_df
SELECT commute.start_station_id, commute.start_station_name, commute.end_station_id, commute.end_station_name, count(*) as trip_count FROM
(SELECT start_station_id, start_station_name, end_station_id, end_station_name, (EXTRACT (DAYOFWEEK FROM start_date)) AS day_of_week, (EXTRACT (hour FROM start_date)) AS thour
FROM `bigquery-public-data.san_francisco.bikeshare_trips`
WHERE subscriber_type = 'Subscriber' AND duration_sec > 60) commute
WHERE (day_of_week BETWEEN 2 AND 6) AND ((thour BETWEEN 7 AND 9) OR (thour BETWEEN 16 AND 18))
GROUP BY commute.start_station_name, commute.end_station_name, commute.start_station_id, commute.end_station_id
ORDER BY trip_count DESC LIMIT 5
commuter_trips_df
There are two categories of recommendations for offers (just below) and a total of four offers (described in detail throughout the remained of this notebook) I suggest based on the data analysis I have completed to this point.
Below, I dive into specific recommendations within each category to increase revenue.
Below we will see that subscribers and customers use Bay Wheels differently. With this in mind, I recommend that we draw a clear distinction in the types of rides available to subscribers and customers, as well as promote an overlap that will encourage subscribers to view the platform for more than just commutes and standard customers to view the platform as more than just a way to enjoy a periodically long bike ride and instead use it for their everyday commute as well.
First, let us examine the length of time rides occur for subscribers vs customers
%%bigquery length_of_ride_df
SELECT subscriber_type, ROUND(AVG(duration_sec)/60, 2) AS minutes
FROM
`bigquery-public-data.san_francisco.bikeshare_trips`
GROUP BY subscriber_type
length_of_ride_df
Additionally, let's examine the number of customers vs subscribers we have using the platform.
%%bigquery subscription_counts_df
SELECT subscriber_type, COUNT(subscriber_type) total
FROM `bigquery-public-data.san_francisco.bikeshare_trips`
GROUP BY subscriber_type;
subscription_counts_df
Last, let's identify the days of the week that the subscription type of "customer" uses the platform. We can compare this to previous work, which identified the days of the week that "subscribers" use the platform.
%%bigquery customer_days_df
SELECT count(trip_id) AS ts, (EXTRACT (DAYOFWEEK FROM start_date) ) AS tday
FROM `bigquery-public-data.san_francisco.bikeshare_trips`
WHERE subscriber_type = 'Customer' AND duration_sec > 60
GROUP BY tday
ORDER BY ts DESC
customer_days_df
As seen in part 1 and backed by the clear examples above that subscribers vs customers use the platform differently, I recommend Bay Wheels draw a clear distinction in membership for what the subscription service offers to its users. Specifically, I recommend that the subscription only work Monday - Friday. The individuals who are subscribers will still have their current needs met, which includes a morning and evening commute, but will allow for further income when they take weekend rides of a significant distance, as the customers do. At the moment, this does not seem to occur very often, as the least popular days of the week for rides for subscribers are Saturday and Sunday. However, for customers, these are the most popular days of the week for rides.
I recommend that Bay Wheels promote weekend rides via advertising on the platform to their subscribers and provide them with a small discount on these rides as well to promote the idea that Bay Wheels is not only a platform for commuting, but is also one that can provide a long, enjoyable weekend trip for an hour, as the customer accounts currently use the platform for. This may also positively alter subscriber opinions of Bay Wheels, as they will not only associate it with work, but also with an enjoyable weekend activity. This change in mindset will increase revenue and customer longevity.
The "customer" base for Bay Wheels is significantly smaller than their "subscriber" base. However, it still makes up approximately 14% of the total users of Bay Wheels, and therefore is a significant source of revenue. This could be further optimized by promoting discounts to these customers for their first month of a subscription. In converting them to a subscriber, this would allow them to likely continue to use the platform on the weekends (the most common days that the "customer" type uses the platform), as well as provide them with a new view that Bay Wheels can serve them almost daily in their work commute. This should increase their reliability on Bay Wheels and lead to an increase in revenue for the company.
We need to ensure that the areas with the highest demand have a supply (bikes available) that meets our user's needs. Specifically I begin by examining stations with peak commuter trips to ensure that during the morning and evening commute bikes are available for riders.
Below, we start off with examining the most popular starting locations for morning commutes (where we need to ensure availability).
%%bigquery morning_commuter_trips_df
SELECT commute.start_station_id, commute.start_station_name, count(*) as trip_count FROM
(SELECT start_station_id, start_station_name, (EXTRACT (DAYOFWEEK FROM start_date)) AS day_of_week, (EXTRACT (hour FROM start_date)) AS thour
FROM `bigquery-public-data.san_francisco.bikeshare_trips`
WHERE subscriber_type = 'Subscriber' AND duration_sec > 60) commute
WHERE (day_of_week BETWEEN 2 AND 6) AND (thour BETWEEN 7 AND 9) -- morning trips on weekdays
GROUP BY commute.start_station_name, commute.start_station_id
ORDER BY trip_count DESC LIMIT 5
morning_commuter_trips_df
Next, I utilize the above SQL query in a fairly complex series of subqueries to identify the availability of bikes at these popular stations throughout the day to determine if there is ample availability at these stations when they are needed.
%%bigquery bike_availability_morning_df
-- The outer query specifies a focus on the hour and availability of bikes
SELECT popular.thour,
AVG(popular.avg_bikes_available) AS num_available,
AVG(popular.percent_bikes_available) AS percent_available
FROM ( -- This query focuses on obtaining bike availability and dockcount at a given station to determine percent utilized
SELECT ba.thour,
AVG(ba.avg_available) AS avg_bikes_available,
(
AVG(ba.avg_available) / MAX(stat.dockcount) * 100
) AS percent_bikes_available
FROM (
SELECT station_id,
AVG(bikes_available) AS avg_available,
(
EXTRACT (
HOUR
FROM time
)
) AS thour
FROM `bigquery-public-data.san_francisco.bikeshare_status`
WHERE (EXTRACT (DAYOFWEEK FROM time)) BETWEEN 2 AND 6 -- get bikeshare availability for stations just on weekdays
GROUP BY station_id,
thour
) ba
-- use bikeshare station information for dockcount statistic
JOIN `bigquery-public-data.san_francisco.bikeshare_stations` stat ON stat.station_id = ba.station_id
GROUP BY ba.thour,
ba.station_id
HAVING ba.station_id IN ( -- Only use the stations that are of relevance (popular for commuting in the morning). This is the query above.
SELECT commute.start_station_id
FROM (
SELECT start_station_id,
start_station_name,
(
EXTRACT (
DAYOFWEEK
FROM start_date
)
) AS day_of_week,
(
EXTRACT (
hour
FROM start_date
)
) AS thour
FROM `bigquery-public-data.san_francisco.bikeshare_trips`
WHERE subscriber_type = 'Subscriber'
AND duration_sec > 60
) commute
WHERE ( -- weekdays
day_of_week BETWEEN 2 AND 6
)
AND ( -- mornings
thour BETWEEN 7 AND 9
)
GROUP BY commute.start_station_name,
commute.start_station_id
ORDER BY COUNT(*) DESC
LIMIT 5
)
ORDER BY avg_bikes_available DESC
) popular
GROUP BY popular.thour
ORDER BY popular.thour ASC
Again, for clarity, this table illustrates the number of bikes available and percent of docks filled at the stations that have the most morning commutes.
bike_availability_morning_df
plt.bar(bike_availability_morning_df['thour'], bike_availability_morning_df['num_available'])
plt.title('Bike Availability at Popular Morning Stations for Subscribers')
plt.xlabel('Hour of Day')
plt.ylabel("Bikes Available")
We can see that there is a significant dip in the availability at these stations during the day, but at the end of the day, they return back to their previous state (likely users on their return commute) and the stations average approximately 2/3 full overnight and have plenty of bikes available at all times for their riders (an average of at least 6, even after the bikes are taken away by morning commuters).
Let's also examine the popular evening starting stations of subscribers for the station's bike availability throughout the day to see if any adjustments need to be made to supply these very popular stations.
%%bigquery evening_commuter_trips_df
SELECT commute.start_station_id, commute.start_station_name, count(*) as trip_count FROM
(SELECT start_station_id, start_station_name, (EXTRACT (DAYOFWEEK FROM start_date)) AS day_of_week, (EXTRACT (hour FROM start_date)) AS thour
FROM `bigquery-public-data.san_francisco.bikeshare_trips`
WHERE subscriber_type = 'Subscriber' AND duration_sec > 60) commute
WHERE (day_of_week BETWEEN 2 AND 6) AND (thour BETWEEN 16 AND 18) -- evening trips on weekdays
GROUP BY commute.start_station_name, commute.start_station_id
ORDER BY trip_count DESC LIMIT 5
evening_commuter_trips_df
%%bigquery bike_availability_evening_df
-- The outer query specifies a focus on the hour and availability of bikes
SELECT popular.thour,
AVG(popular.avg_bikes_available) AS num_available,
AVG(popular.percent_bikes_available) AS percent_available
FROM ( -- This query focuses on obtaining bike availability and dockcount at a given station to determine percent utilized
SELECT ba.thour,
AVG(ba.avg_available) AS avg_bikes_available,
(
AVG(ba.avg_available) / MAX(stat.dockcount) * 100
) AS percent_bikes_available
FROM ( -- get each stations availability by hour
SELECT station_id,
AVG(bikes_available) AS avg_available,
(
EXTRACT (
HOUR
FROM time
)
) AS thour
FROM `bigquery-public-data.san_francisco.bikeshare_status`
WHERE (EXTRACT (DAYOFWEEK FROM time)) BETWEEN 2 AND 6 -- get bikeshare availability for stations just on weekdays
GROUP BY station_id,
thour
) ba
-- use bikeshare station information for dockcount statistic
JOIN `bigquery-public-data.san_francisco.bikeshare_stations` stat ON stat.station_id = ba.station_id
GROUP BY ba.thour,
ba.station_id
HAVING ba.station_id IN ( -- Specify an interest in just the stations that are of relevance (popular for commuting in the evening)
SELECT commute.start_station_id
FROM (
SELECT start_station_id,
start_station_name,
(
EXTRACT (
DAYOFWEEK
FROM start_date
)
) AS day_of_week,
(
EXTRACT (
hour
FROM start_date
)
) AS thour
FROM `bigquery-public-data.san_francisco.bikeshare_trips`
WHERE subscriber_type = 'Subscriber'
AND duration_sec > 60
) commute
WHERE ( -- weekdays
day_of_week BETWEEN 2 AND 6
)
AND ( -- evenings
thour BETWEEN 16 AND 18
)
GROUP BY commute.start_station_name,
commute.start_station_id
ORDER BY COUNT(*) DESC
LIMIT 5
)
ORDER BY avg_bikes_available DESC
) popular
GROUP BY popular.thour
ORDER BY thour ASC
This table illustrates the number of bikes available and percent of docks filled at the stations that have the most morning commutes.
bike_availability_evening_df
plt.bar(bike_availability_evening_df['thour'], bike_availability_evening_df['num_available'])
plt.title('Bike Availability at Popular Evening Stations for Subscribers')
plt.xlabel('Hour of Day')
plt.ylabel("Bikes Available")
There is surprisingly less of a dip in the availability at these stations in the evening when they are popular to start from. This must be caused by a flux of bikes also appearing in the station at that time as well.
It is clear that for popular commuter stations, specifically those in the morning, taking out a bike is more popular at certain times of day than others. Bay Wheels could encourage potential subscribers to try a ride to work with a notification on the Lyft application and a discount code. This should help increase the subscriber conversion rate, and it is clear that even though bikes are less available at these stations due to the morning commute, Bay Wheels still has enough in the area to optimize this further. If it is determined that not enough bikes are available and demand is too high (ex. if it is determined that the reason several bikes are consistently available at these stations is because they are broken down / not usable) or the previous changes cause demand to be too high, prices could be increased on "customer" rides at these times to increase revenue and support subscribers. During non-peak subscriber hours where the bikes are largely stagnant at these locations, discount codes could be offered to potential customers in the area through the Lyft application to encourage ridership, such as during lunch hour.
Next, I examine stations that are popular and not popular to start a ride from. The query results indicate a vital problem that needs to be addressed: There are some stations that have been around for years, but still only have generated several hundred rides and have on average at least 5 bikes present. Meanwhile, others have generated tens of thousands of rides, and are therefore in much more popular areas and generate a greater amount of revenue.
%%bigquery identify_stagnant_stations
SELECT bs.station_id, si.total_trips, DATE_DIFF(DATE(si.max_end), DATE(si.min_start), DAY) AS days_of_operation, AVG(bikes_available) AS avg_available
FROM `bigquery-public-data.san_francisco.bikeshare_status` bs
JOIN (SELECT start_station_id, COUNT(*) AS total_trips, MIN(start_date) AS min_start, MAX(end_date) AS max_end FROM `bigquery-public-data.san_francisco.bikeshare_trips` GROUP BY start_station_id) si ON si.start_station_id = bs.station_id
GROUP BY bs.station_id, si.total_trips, si.min_start, si.max_end
ORDER BY total_trips ASC
NOTE "days_of_operation" is calculated from the first trip that started from a station to the latest trip starting from that station. Therefore it is an imperfect metric in what it is being referred to as, but is nonetheless still useful.
identify_stagnant_stations.head(10)
identify_stagnant_stations.tail(10)
plt.scatter(identify_stagnant_stations['days_of_operation'], identify_stagnant_stations['total_trips'])
plt.title("Station Days in Operation vs Number of Trips from Station")
plt.xlabel("Days in Operation")
plt.ylabel("Number of Trips")
It is clear that there are many stations that have bikes that are not frequently being used. Whenever bikes are not being used, they are not generating revenue for Bay Wheels. While holding back a couple of bikes in less popular areas makes sense to ensure availability at all stations, whenever there is a surplus of bikes (at these less popular stations, I believe more than 4 would count as a surplus considering they are rarely being used), we could notify subscribers and potential customers in the area at a regular interval (once per week) that there is a bike in the area that is discounted for them (ex. 20%) to ride should they drop it off at a new station when they are done.
The four recommendations are described above. All of them come from analyzing a fairly old dataset (August 29, 2013 to Agust 31, 2016), which will now include outdated information. Furthermore, the data itself is not clean. For example, it was identified during exploratory data analysis that station 87 was a station that had poor data with very little recorded data. Many of the other, newer stations such as station 91 also contained very few trips, and therefore should be examined differently and with additional caution compared to the stations with greater longevity. Furthermore, some of the data was not clean, such as station 22 having multiple names with just a slight misspelling. These were carefully considered during the analysis process in order to not cause errors or bias in the final results.
The four recommendations are again stated below, but please refer to them above for greater detail on each.
1) Recommendation 1: Alter the "Subscriber" Subscription to Only Work on Weekdays and Create Discounted Weekend Rides for These Users
2) Offer Discounts to the "Customer" Users on a Subscription with Bay Wheels
3) Alter Pricing at Specific Times of Day
4) Create a New Offer on Bikes at Stagnant Stations