Location-Based Social Networking Data

Trip distribution is an invaluable portion of the transportation planning process; this distribution leads to the creation of origin–destination (O-D) matrices. Location-based social networking (LBSN) has increased in popularity and sophistication and has emerged as a new travel demand data source. Users of LBSN provide location-sensitive data interactively with mobile devices, including smartphones and tablets. These data can provide O-D estimates with significantly higher temporal resolution at a much lower cost in comparison with traditional methods. An LBSN O-D estimation model based on the doubly constrained gravity model was proposed to improve a previously proposed model based on the singly constrained gravity model. The proposed methodology was calibrated and comparatively evaluated against the O-D matrix generated by the method based on the singly constrained gravity model as well as a reference matrix from the local metropolitan planning organization. The results of this method illustrate significant improvement in reducing the O-D estimation errors caused by the sampling bias from the method based on the singly constrained gravity model.

Trip distribution is a significant step in the transportation planning process; it generates the static or dynamic origin-destination (O-D) trip patterns to be used by traffic assignment models. The existing data collection methods for trip distribution can be classified into three main categories: survey based, traffic count based, and positioning technology based. Survey-based methods such as telephone, in-person interview, mail, or e-mail survey can collect complete sociodemographic information of travelers and households and trip information. These methods are time-consuming and labor intensive and can only generate static travel demand information at low frequency (e.g., every 5 to 10 years) because of funding and resource limitations. Methods based on traffic counts calibrate an O-D matrix based on traffic detector data (1)(2)(3)(4)(5). These methods are relatively low in cost and can provide dynamic O-D information if calibrated properly. However, they require an existing O-D matrix and rely on traffic assignment models to generate accurate traffic flow data to be compared with field detector data in model calibration. The use of positioning technologies for O-D data collection has the potential of producing O-D data with much higher spatial and temporal resolution, larger sample size, and less cost than survey-based methods (6)(7)(8). The complication lies in (a) the penetration rate of a specific positioning technology in the mobile devices of travelers, (b) privacy protection, and (c) the uncertainty in determining trip purposes and destinations because of positioning errors.
In the United States, the affordability and accessibility attributed to recent technological advances has allowed smartphones and tablets with location-based service features to be available to individuals of diverse income levels. This development in conjunction with the fast expansion of social networks attracts a substantial number of users active in relaying their personal activities online, often including their locations. Location-based social networking (LBSN) combines the aspects of social networking with location-based service features; LBSN provides some advantages over other positioning technologies (9). Users produce trip purposes and destination information through applications with built-in GPS by checking in at particular venues. The sample provided from this methodology has the potential to be larger than that of other methods because of the penetration rate of social networking services, which are growing at a rapid pace. Furthermore, the lack of auxiliary data collection devices and availability of real-time updated data make this method of data collection an attractive low-cost option. In a previous study, a method based on a singly constrained gravity model was proposed and evaluated (10). Although the study revealed promising potential of LBSN data for O-D estimation, the proposed model still has some limitations, especially the significant bias in O-D patterns related to short-distance trips and residential areas.
A method based on a doubly constrained gravity model is proposed whose improved learning capability, when compared with the singly constrained method, during model calibration can reduce the sampling bias of LBSN check-in data.

State-of-Practice review on o-d Estimation
Conventionally, O-D matrices are derived by expanding sample O-D matrices collected from traditional household travel behavior surveys based on sociodemographic and economic data for a planning area.

Location-Based Social Networking Data
Exploration into use of doubly constrained gravity Model for origin-destination Estimation These survey methods include personal home interviews, telephone interviews, mail survey, and Internet survey. Personal home interviews are one of the most complete data sources with the highest response rate, 60% to 70%, when compared with other household survey methods (6). Home interviews are the most expensive and timeconsuming method, whereas telephone, mail, and Internet-based surveys involve significantly less cost and time. This reduction in time and cost comes with a decrease in response rate and introduces sampling bias. In recent years, travel surveys assisted by GPS have become popular both in the United States and internationally (11). However, significant incentives as well as logistical issues, including battery outages leading to incomplete data and loss of GPS units, have been identified. Moreover, studies have shown that participants may be burdened by the extended length of GPS surveys, equipment complications, and privacy concerns (11), and samples are often biased (7).
Traffic counts have been implemented as a data collection method for use in O-D matrices. Studies have shown that O-D matrix creation is possible given traffic volumes for each transportation link (1)(2)(3)(4)(5)12). However, many different matrices can be reproduced from observed traffic counts, and deployment of a comprehensive detector infrastructure on all viable routes would be required. In addition, concerns about the accuracy of estimated traffic conditions between fixed detectors have been discussed in recent research efforts (13). A complementary survey method for the traffic count-based method is the roadside intercept survey, which provides additional information regarding the O-D composition of traffic flow in a road section.
In recent years, secondary planning data sources, such as GPS, cell phone, and Bluetooth, have caught researchers' attention. As opposed to the aforementioned GPS-based survey, recent research efforts have demonstrated the feasibility of replacing traditional survey methods with O-D data directly derived from GPS trajectories generated by travelers' in-vehicle devices (6,8). Cell phones have been explored for their data collection capabilities through their employment of wireless location technologies. Studies have shown that cell phone technologies are both theoretically and experimentally feasible with reasonably precise estimation results (14,15). Penetration rates needed to achieve the spatiotemporal coverage of a network are between 2% and 3% (16). There are limitations with the technology. The spatial resolution of the cell phone positions may be within a cellular cell or location area that may include multiple travel analysis zones (TAZs). The method based on location-based service can significantly increase the spatial resolution, but users may not turn on the location-based service function or report their location-based service data because of privacy concerns. Recently, Bluetooth has been noted to be a low-cost and user-friendly method for data collection (17)(18)(19). Use of a unique media access control assigned by each device's manufacturer alleviates privacy concerns affiliated with other methods of data collection. However, the technology is limited by the short ping cycle, which could lead to over sampling of devices; the potential for a single vehicle to have multiple Bluetooth-capable devices; as well as the ability to turn off Bluetooth functions in a device. In addition, the variability of Bluetooth samples could yield objectionable expansion errors; this problem negates the technology's ability to independently create an estimate for an O-D matrix (20).
Currently, research is being conducted to determine the ability for "big," vehicle-to-infrastructure, and smartphone data to be used for O-D matrix creation. Big data include transactional (i.e., credit card purchases and payment records and product and service logs), interactional, and observational data (21). Although these data sources have great potential, there are limitations to their incorporation into transportation planning, specifically with the ability for data to be shared. Also, data capture, management, and storage pose potential difficulty with utilization, and biases may exist. Similar to the credit card data mentioned within big data, transactional data have been researched for potential use to improve transit planning (22). Utsunomiya et al. noted that concerns with market penetration, sampling bias, privacy concerns, as well as errors with transaction and routing assignment exist with the method (22). Vehicle-to-infrastructure technology has recently been explored as providing a potential new data source; research indicates that the use of the dedicated shortrange communications connecting vehicles to infrastructure would have the potential to collect data on every vehicle in the system and effectively eliminate the need for an estimated O-D matrix (23). With the exception of vehicle-to-infrastructure test beds, this method of data collection is not currently viable and privacy concerns would need to be addressed before its acceptance.

Location-Based Service and austin LBSn data characteristics
Location-based services are services that use location and time data as a control feature. This feature has been encompassed within social networking to create LBSN. With the increased popularity of sites like Facebook, Twitter, and Foursquare, which include LBSN, this form of data collection has been explored recently for comprehension of user spatial patterns. The first study exploring this area was by Li and Chen, who used the Markov-based location predictor to determine future locations of users with an accuracy of 49% (24). The relationships between geographic movements, the temporal dynamics of human movements, and social networking ties were investigated in various studies (9,25,26). In addition, studies by Backstrom et al. and Cheng et al. demonstrated the ability to predict user locations via the user's friends and content, respectively (27,28).
Many social networking sites have added features that allow users to check in to a place of interest, which is called a venue. This capability allows individuals to share and save places that have been visited with fellow users and friends. Foursquare is the most popular site that includes this feature, and as of January 2013 it had more than 30 million users worldwide with over 3 billion check-ins. Users of this particular site include businesses, which encourage check-ins through promotions and discounts. Because of the site's popularity, high penetration rate, and large sample size, researchers have used the available LBSN data to investigate mobility patterns across spatial, temporal, and social aspects (29,30).
Yang et al. were among the first to use Foursquare data to specifically estimate an O-D matrix (10). This study examined noncommuting trips in the Chicago urban area and demonstrated the promising potential of the methodology. Jin et al. furthered this effort by examining the use of check-in data to analyze the O-D demand for Austin, Texas, with a singly constrained gravity model with a two-regime friction factor; the study illustrated the potential of LBSN data for travel demand analysis and monitoring (31). The detailed LBSN O-D estimation model based on a singly constrained gravity model is as follows: where P i = productions for zone i; . . . , P indicates the pth venue type; A j = attractions for zone j; γ = adjustment factor to zonal trip production from Foursquare check-in counts; ε = adjustment factor to zonal trip attractions for Foursquare check-in counts; 1/NΣ i (γ − ε)x i = residual term for zone i that ensures total production equal to total attraction; T ij = number of trips between origin zone i and destination zone j; and F(d ij ) = friction function, where d ij is the travel cost between zones i and j.
For the calibration of the model, the trip-balancing process is applied to the following equations iteratively: where n is a step in the iterative process.

MEthodoLogy
The proposed model attempts to address several limitations from the previous model. First, the zonal productions and attractions generated by the previous model usually result in symmetric patterns due to the uniform distribution of the residual term among all zones (see Equation 2). Second, the singly constrained model only tunes the zonal attractions. Third, the convergence rate of the singly constrained gravity model is relatively slow, which causes the model calibration to be slow and premature. For those limitations to be addressed, a new model based on a doubly constrained gravity model and zone-specific residual assignment is proposed: where ρ = power of location factor, β i = balancing factor for productions, α j = balancing factor for attractions, and x ρ i /Σ j x ρ i = factor to redistribute the residual on the basis of zonal check-in counts for zone i.
In this way, the residuals are assigned on the basis of the check-in intensity rather than even distribution among all zones. The initial values of P i and A j are calculated directly from the Foursquare checkin counts on the basis of Equations 6 and 7. T ij is then calculated from P i and A j on the basis of Equation 8.
The doubly constrained model can be calibrated by iteratively updating α j and β i . In this study, α j (0) = 1 and β i (0) = 1. The values of α j and β i are updated by using the following equations: For consistency, the same friction function was engaged for the doubly constrained model that provided the best coincidence ratio (CR) in the singly constrained model. The CR measures the percentage of the area that coincides for the two curves or distributions that are being compared (32). The friction function combined the linear model for short trips and the negative exponential model for long trips as follows: where θ, λ, µ, ρ, and σ are factors that were optimized through the genetic optimization algorithm and d ij is the Manhattan distance in miles between the centroids of origin zone i and destination zone j. The dual-regime formulation is used to capture the special treatment by Texas Capital Area Metropolitan Planning Organization (CAMPO) on O-D pairs with short distances.

Study area
The city of Austin, Texas, was selected as the study area. Austin is a diverse city that encompasses an area of 272 mi 2 and has an estimated population of almost 1 million. The city of Austin (33) was demographically compared with U.S. Foursquare users (34) as well as the general U.S. population (35), as shown in Figure 1. Foursquare users include a higher proportion of individuals between the ages of 25 and 54; this segment constitutes 80% of the site's users. This age group also has a greater distribution than is seen in Austin and in the United States. In addition, there are significantly more female users of Foursquare (65% women compared with 35% men); this distribution is also notably different from the distribution of gender in Austin and the United States. Examination of the educational and income trends of the Foursquare user in the income categories of $25,000 through $74,999 as well as in the Some College category reveals an overrepresentation as compared with the Austin and U.S. data. Finally, Foursquare prohibits users under the age of 13; this distribution is shown in the percentages of 17 and under and Less Than High School users. The foregoing potential sampling bias needs to be properly addressed when the number of Foursquare check-ins is converted to trip counts. CAMPO has identified 520 TAZs in the jurisdiction of the City of Austin; this area will serve as the study area for this paper. CAMPO's 2005 travel demand model serves as the reference data used for the analysis. CAMPO data are not considered the ground truth data because of the limitations of current data collection methods. These data serve as reference data for identifying critical empirical O-D patterns. The trip purposes identified in the CAMPO study were combined into eight categories: 1. Home-based work, 2. Home-based nonwork retail, 3. Home-based nonwork other, 4. Home-based nonwork University of Texas, 5. Nonwork airport, 6. Non-home-based work, 7. Non-home-based other, and 8. Non-home-based external.

Data Collection
Foursquare data were collected by first identifying the venues in the study area. Figure 2 shows the 19,710 venues identified in the study area; this distribution demonstrates the special coverage of the data. All TAZs with the exception of three had at least one venue, and the majority of venues were located in the denser urban areas.
Once the venues were identified, a trolling algorithm was used to collect check-ins for the creation of an hourly rate for each venue during the analysis period, Tuesday, June 11, through Tuesday, July 2, 2012. The data collected included the venue ID, venue name, category, latitude, longitude, number of check-ins per hour, and the number of unique users. An initial analysis of the checkins was performed to verify that a category was assigned to each venue. These categories (Table 1) include Arts and Entertainment, Colleges and Universities, Food, Professional and Other Places, Nightlife Spots, Residences, Great Outdoors, Shops and Services, and Travel and Transport. Since categories are assigned by venue creators and are optional, some venues did not have a category assignment. For those venues, a keyword search was performed when possible to assign the appropriate primary category. Table 1 provides a categorical breakdown of the number of venues and check-ins collected. Of the 10 venue categories, Shops and Services has the largest percentage of venues, and the lowest percentage is associated with Nightlife Spots. Check-ins are most frequently associated with Shops and Services and Food, which account for 51.3% of all check-ins. The Residences category has the least number of check-ins at 2.7% and a moderately low number of venues in the sample size. The average number of check-ins per venue was also calculated for each category; the largest average number of check-ins comes from the Nightlife Spots category (1,224 check-ins) and the least comes from the Unclassified category (67 check-ins). The top three average check-ins were in the previously mentioned Nightlife Spots, as well as the Food and Travel and Transport categories. Although the findings for Nightlife Spots

FIGURE 1 Comparative demographics: (a) age, (b) gender, (c) household income, and (d ) education.
and Food categories are to be expected since they involve social activities, the large number of Travel and Transport check-ins is unexpected. In addition, because of the low percentage (1.5%) of check-ins for the Unclassified category, it was determined that their removal from the study would not negatively affect the analysis.

Model Calibration
For calibration of the proposed model, a genetic algorithm was implemented. This algorithm in MATLAB optimizes through the mimicking of the principles of biological evolution via the repeated modification of a population of individual points by using rules modeled on gene combinations in reproduction. This optimization strategy was selected for the improved chances of finding a global solution due to the algorithm's random nature. With the algorithm's calculations, individuals are randomly selected from the current population and used as parents of the children for the next generation. This process is repeated and the population eventually evolves toward an optimal solution. The genetic algorithm was used to obtain parameters for the friction function and the production and the attraction calculations that would in turn minimize the mean absolute error (MAE) between the modeled O-D matrix and the reference CAMPO O-D matrix. For  where p w M is the percentage of trips within the trip length interval w in the predicted trips from the check-in data, in which the trip length interval is used to aggregate the trip counts with an aggregation interval of 1 mi (1.609 km), and p w O is the percentage of trips within the trip length interval w in the survey trips from CAMPO.
The value for the CR ranges from 0 when the distributions are completely different to 1 when the distributions are exactly the same. For this study, the higher the CR between the check-in and the CAMPO results for each model, the better the model. Table 2 provides the results from the genetic optimization. In general, the calibrated parameters are similar except for the attraction scaling factor ε and the friction factor function parameters θ and σ. Significant improvement can be observed for the CR-and MAE-values from the proposed model.

ExPEriMEntaL EvaLuation
A comparison between the calibrated singly constrained, doubly constrained, and CAMPO O-D matrices was done by examining the trip length distributions, the zonal trip production and attraction rates, and the zonal O-D flow patterns.

trip Length distributions
As with the CR, trip length distribution curves were examined to illustrate how closely the model output data match the reference data. Figure 3 shows the trip length distributions (a, b) and the cumulative trip length distributions (c, d) for the singly and doubly constrained models compared with the reference CAMPO O-D matrix. Examination of the trip length distribution shows that the doubly constrained model (Figure 3, c and d) is relatively constant with respect to the general curvature. However, for the singly constrained model (Figure 3, a and b), under estimation occurs for short trips and slight overestimation occurs for long trips. For the singly constrained cumulative distribution (Figure 3b), slight underestimation is consistently shown for the curve. Although the curves do generally follow the same paths, the deviations indicated lend themselves to further fine-tuning of this method.

Zonal Production and attraction rates
For the validity of the methods used to associate the check-ins with the various venues throughout the study area to be determined, heat maps were created that show the productions and attractions for each model. Figure 4a demonstrates where the methodology excels and where there are limitations for the production calculations. With the CAMPO production map as a reference map, the singly constrained model shows high-production areas that are significantly fewer in number. In addition, the singly constrained model shows a midlevel production area through the study region whereas the CAMPO map is more polarized. Conversely, the doubly constrained map shows production rates that are similar in magnitude to those on the CAMPO map through the study region, where TAZs that include the central business district, airport, as well as areas dense with living, entertainment, retail, and food venues are consistently depicted as large production generators. Figure 4b provides heat maps for the attractions of each model; the maps highlight where the methodology excels and where there are limitations for the attraction calculations. Once again, with the CAMPO attraction map as the reference, the singly constrained model predicts attraction rates similar to those of the CAMPO model for many areas, but it suffers from the inability to associate high attraction rates with all of the TAZs identified in the CAMPO map. The doubly constrained map demonstrates the model's ability to better identify areas with high attraction rates. However, the map highlights areas where overestimation occurs, namely, in the northwestern portion of the map. Although CAMPO data are used as a reference, the data still have limitations, for example, the high variations in trip frequencies among zones potentially caused by under-or oversampling in certain zones.

Model Matrix comparison
The next step in the analysis of the methodology was to examine the zonal flow pattern for each model; this pattern can be regarded as the visualization of the O-D matrices. Destination zones are located along the horizontal axis, and origin zones are along the vertical axis ( Figure 5). The O-D flow intensity, I ij , is calculated as follows: 10 15 The O-D heat maps discussed next were used to provide an illustration of the distribution of trip intensities among TAZs. Each grid (i, j) in the O-D heat map indicates the I ij value from zone i to zone j; higher values are illustrated by lighter colors. A light horizontal or vertical band indicates a high production or attraction zone and light areas indicate heavily interacting zones. A difference diagram is also shown ( Figure 5, d and h) to demonstrate how the estimation error is distributed among the different O-D pairs. Overall, the heat maps provide a more detailed description of O-D patterns and error distributions as compared with a single performance measure.  Figure 4. In addition, the MAE matrix is provided to demonstrate how closely the estimated Foursquare matrix matches the CAMPO matrix.     The feasibility of using the LBSN data to analyze the urban travel demand pattern with a doubly constrained gravity model is investigated. Check-in data from Foursquare, a leading LBSN provider, were used to create production and attraction rates for the singly and doubly constrained gravity models, which were used in conjunction with the CAMPO O-D matrix to examine the predictability of the proposed methodology.

Method Based on Doubly Constrained Gravity Model
In comparison with the traditional methods used for O-D estimation, this study shows that LBSN data have potential. LBSN data are a low-cost option for updating the O-D matrix since the only cost comes from purchasing the historical data from Foursquare, Twitter, or other LBSN data vendors. The O-D matrix can be updated annually, monthly, or weekly depending on the requirements of the metropolitan planning organization. Compared with the prevailing secondary data methods based on GPS, Bluetooth, and cell phones, the LBSN data have user-confirmed trip purposes and destinations; this advantage eliminates the need for conducting reverse geocoding and recurrent trip pattern recognition. Furthermore, because of its intensive spatial and temporal coverage, LBSN data have the potential to become a promising dynamic travel demand data source for active traffic and demand management solutions (36).
LBSN data also have their bias for different venue types (e.g., residential areas, recreational locations, and tourist attractions). In comparison with the method of the existing singly constrained gravity model, the proposed method based on the doubly constrained model demonstrates better learning capabilities. There are some limitations with the proposed methodology that should be examined in future research. The model results still indicate some geographical bias for tourist regions (i.e., the northwest region in Figure 4) and residential areas. In addition, the estimated O-D matrix still has significant error for intrazonal trips (the diagonal in Figure 5). Further examination into the temporal aspects of the models as well as specific trip purposes should be done to further validate this proposed methodology.

acknowLEdgMEntS
The authors thank Foursquare for allowing the research team to obtain data through the programming interface of their developer application and the Capital Area Metropolitan Planning Organization for providing the 2010 O-D data used in this study.