Continue to Site

Eng-Tips is the largest engineering community on the Internet

Intelligent Work Forums for Engineering Professionals

  • Congratulations cowski on being selected by the Eng-Tips community for having the most helpful posts in the forums last week. Way to Go!

Cleaning data in excel (traffic counts)

Status
Not open for further replies.

mralbuto

Civil/Environmental
Mar 15, 2013
2
I have data of daily vehicle counts in 5 minute increments of a section of road for an entire year. The problem is the data has lots of missing entries done purposefully. I am tasked with using the functions of excel to identify the errors, counting how many exist and then removing them. To do this in excel i am clueless. I guess what I am really asking is how do I firstly define what is an error (missing entry) and then how do i count them?

Any help would be greatly appreciated
 
Replies continue below

Recommended for you

You've already defined error = missing. What else do you expect us to say?

If data is missing on purpose, how is that an error?

How do you expect to remove data that isn't there in the first place?

Why count something that you are going to remove?
 
mralbuto, please do these things:

Offer more complete explanation - MintJulep's questions are valid I think. It helps us a lot to know the background so that we know what your real goal is.

Post a small section of your spreadsheet that illustrates the format (with headings) and contains an error. Within the spreadsheet, highlight the error.

Explain to us how you know the highlighted error is, in fact, an error. We don't know if it's a blank line, a sequence of times that skip a 5-minute increment, or what?

Read the forum policies, especially number 4. Posting the same question in more than one forum causes confusion. Having two active threads on the same topic prevents us from seeing the complete train of thought. There is a link to the forum policies below my signature in this post.

Here is a link to your other thread, I suggest you red-flag it for deletion. This is a more appropriate place for your question I think.



Best to you,

Goober Dave

Haven't see the forum policies? Do so now: Forum Policies
 
I assume the errors are the blank spaces and the entire missing dates? Let us know.

I see blank spaces in lots of columns and there must be 40 dates missing since your data spans only 325 columns from 1/1 to 12/31. I happened to see that 3/23 and 3/24 are missing (entire days). I'm guessing that you have to count missing days separately from missing values in a column? Please clarify.

You can find blank spaces and add them with a simple if and sum to get a total. (Note: I'm not the most fluent Excel guy. Someone else may have a much more elegant solution.)

You might put in LO4 something like =if(b4="",1,"")

Then copy it 325 columns to the left and however many rows you have down. Then sum the new array for a total.

Does "remove them" mean to fill the blank spaces with something? Delete the blank spaces and move the good data below them up to fill in? Or does it mean something else?

Explain "remove them" in detail, and what to do with missing dates versus missing data on a single date, and you'll get some better help perhaps.

Best to you,

Goober Dave

Haven't see the forum policies? Do so now: Forum Policies
 
What is your ultimate goal? Do you really need to "remove" the bad data? (odd since it already is not there). Or do you need to perform calculations that don't include the bad data?

Statistical functions in Excel do not appear to include the blank values. I have attached a revised spreadsheet with the following changes:
First column calculates the average for the time period. (Excel function AVERAGE)
Second column calculates the standard deviation for the time period. (Excel function STDEV.P)
Third column counts the number of non-blank entries. (Excel function COUNT)
Fourth column counts the number of blank entries. (Excel function COUNTBLANK)
At the top left of the sheet I have calculated the total number of Blanks.
 
 http://files.engineering.com/getfile.aspx?folder=1ca7b83e-42f4-433b-96c0-c39168115f12&file=Assignment1-UncleanedDataset_REV.xlsx
You can use the COUNT(range) function to find how many values (non-blank) are in each column and hence the number of missing values.
The data can be sorted to group all blanks together for easy deletion.
TerryScan has probably given a more complete explanation.
 
I think the original data is in another format- translation error have left blanks so the goal is to get a summary of the time stamps of the unrecovered data so that data technicians can be tasked with recovering data when determined to be worth the effort and given that total file is humongous then manual searching for these gaps is too much. I did something like this semimanually.
 
If you sort your data in Excel, it will group all of your data with blank cells (as described above). The COUNTBLANK function can show you how many blank cells you have. If Excel is too slow, you might do some command line work to get at the information you're after.

If the data is in text format to start with, you may find it more efficient to use sed delete all of the lines except the ones you're interested in. If the data is in CSV format, for example, I think the following will delete lines without blank entries, send them to newfile1:
Code:
cat datafile.csv|sed '/,,/!d' newfile1
Or deleting lines with blank entries, sending them to newfile2:
Code:
cat datafile.csv|sed '/,,/d' newfile2
You can then count the lines without blank entries using wc:
Code:
wc -l newfile1
Or with blank entries:
Code:
wc -l newfile2

Easy peasy.
 
Status
Not open for further replies.

Part and Inventory Search

Sponsor