Overview
My project is based on data that highlights the gas usage by households located in NYC. The main purpose of this is to identify whether or not there is a trend, in terms of the borough and average gas usage, and if there is, to display it in a meaningful way. The problem to solve is if there is a difference in gas usage depending on certain locations, such as between Manhattan vs other boroughs. The solution is to use various python methods to clean the data and perform exploratory data analysis to identify trends between gas usage, location, and population. This relates to NYC because it would be interesting to see if there are any trends in gas usage based on the borough. NYC boroughs are always compared/contrasted with each other and it would be interesting to see if there are any significant differences in gas usage.
​
My underlying hypothesis is that the borough with the most residential households and the largest population will be the borough with the highest natural gas usage. I predict that the average gas usage of an average person will be the same regardless of the borough.
​
The datasets I used are:
-
https://data.ny.gov/Government-Finance/New-York-State-ZIP-Codes-County-FIPS-Cross-Referen/juva-r6g2/data . This dataset includes the borough of zip codes in NYC.
​
-
https://data.cityofnewyork.us/w/uedp-fegm/25te-f2tw?cur=zh_ETNBDGVA&from=root . This dataset shows gas usage of households/industries during 2010
​
-
https://data.ny.gov/widgets/xywu-7bv9 . This dataset shows the population of each borough.
​
I used Python3 to perform all data cleaning, analysis, and visualization. The libraries used are standard libraries such as pandas, pandasql, and matplotlib.pyplot. I used these libraries in order to clean/filter the datasets, perform exploratory data analysis, and create visualizations for the results of my analysis.
Data Section
The datasets I used are:
-
https://data.ny.gov/Government-Finance/New-York-State-ZIP-Codes-County-FIPS-Cross-Referen/juva-r6g2/data . This dataset includes the borough of zip codes in NYC. This dataset contains the columns: County Name, State FIPS, County Code, County FIPS, ZIP Code, and file data. The columns that are necessary for this project are the County Name and the Zip Code. Each zip code is in the same row as their county and can be used to group the zip codes together into their respective counties to be used for further analysis.
-
https://data.cityofnewyork.us/w/uedp-fegm/25te-f2tw?cur=zh_ETNBDGVA&from=root This dataset shows the total consumption of natural gas usage (in therms or GJ) of households/industries in each zip code in NYC during 2010. The average consumption of each zip code can be calculated and each zip code can be categorized into its respective borough. This dataset is integral to the project because it contains all the information on average gas usage that can be used for further analysis and visualizations.
​
-
https://data.ny.gov/widgets/xywu-7bv9 This dataset displays the total population of each borough from 1950 to 2040. Populations after 2020 are predicted populations and not known numbers, however, this does not affect the project as I am only using population data of 2010. This dataset would be used to compute the natural gas consumption of an average person in each borough.
Techniques Section
-
Firstly, I used pandas to load in all the datasets that I need to clean and filter for my project. For the gas usage dataset, I used pandas to filter the building type to just residential, small residential, and large residential since the focus of my project is on household usage, not institutions. Then I used pandasql to query the data frame and average each distinct zip codes gas usage. I also used pandas to filter the population dataset by dropping columns that are not necessary for the project. Only the 2010 population data was kept because the gas usage data is for 2010.
​
-
Next, I used pandas to find the 5 zip codes with the largest average consumption and the 5 zip codes with the smallest average gas consumption. I used pandas to filter the county/borough dataset in order to only keep the county name and the zip code. This is so I can perform an inner merge between the zip code dataset and the county/borough dataset on the 'ZIP Code' column. To find the average gas usage per borough, I used pandasql to query the merged data frame to group by county name. I continue to use pandas to perform calculations for further analysis of the datasets.
​
-
Lastly, matplotlib.pyplot was used to create the visualizations for this project. Matplotlib.pyplot was used to make graphs of the exploratory data analysis performed, including a graph for the average gas usage of each borough, a gas usage of an average person in each borough, a graph showing the top 5 zip codes with the largest average gas consumption, a graph showing the population of each borough and a graph showing the 5 zip codes with the smallest average gas consumption.