INTELLIGENT WORK FORUMS
FOR ENGINEERING PROFESSIONALS

Log In

Come Join Us!

Are you an
Engineering professional?
Join Eng-Tips Forums!
  • Talk With Other Members
  • Be Notified Of Responses
    To Your Posts
  • Keyword Search
  • One-Click Access To Your
    Favorite Forums
  • Automated Signatures
    On Your Posts
  • Best Of All, It's Free!

*Eng-Tips's functionality depends on members receiving e-mail. By joining you are opting in to receive e-mail.

Posting Guidelines

Promoting, selling, recruiting, coursework and thesis posting is forbidden.

Jobs

20% of Scientific Papers On Genes Contain Conversion Errors Caused By Excel, Says Report
2

20% of Scientific Papers On Genes Contain Conversion Errors Caused By Excel, Says Report

RE: 20% of Scientific Papers On Genes Contain Conversion Errors Caused By Excel, Says Report

Quote:

Savvy Excel users know all about this.

... and even savvy biologists knew all about it more than 10 years ago:
http://bmcbioinformatics.biomedcentral.com/article...

That link is from the recent paper, which can be read in full at:
http://genomebiology.biomedcentral.com/articles/10...

The graph of the frequency of these errors is quite interesting:


As Skip says, we have discussed this "feature" here before, but to blame it all on Excel is ridiculous. Aren't these papers supposed to be peer reviewed? Do they blame unit conversion errors, or misplaced bracket errors on Excel as well?

Anyway, the solution (which apparently was pretty well known in 2005, and has since been forgotten) can be found in the 2004 paper:

Quote:


There are a number of work-arounds to these behaviors in Excel, but all of them require continued attention on the part of the user to avoid introducing errors. The appropriate solutions depend on the context in which Excel is opened:

If Excel is configured to open a test file automatically from another application, then the data must be pre-processed in the upstream application. For example, a space character or an apostrophe can be placed in front of the gene name. That is the solution implemented in the Excel output format option of MatchMiner [1] and the primary approach recommended by Microsoft in their Knowledge Base Article on the issue [6].

If a text file is to be opened by Excel, open Excel first and then select the text file to read. Then select "text" mode for the column(s) containing potentially affected symbols in the Text Import Wizard Step 3 of 3.

If text is to be copied from another application (such as a text processor) and pasted into a pre-opened Excel spreadsheet, the formatting must be set in the spreadsheet. Within the pre-opened spreadsheet, prior to pasting, use Format -> cells to specify which columns of the recipient spreadsheet are to be treated as text [6]. That procedure works for copying from several text processors tested on Mac OS 10.2. However, changing the format of the spreadsheet column to text fails to solve the conversion problem when pasting data from a Microsoft Word file. In that case, in addition to the formatting, use the Paste Special -> Paste: As: Text command to insert the text.

But I don't agree with their next statement:

Quote:

Despite the work-arounds, even the most vigilant investigator can inadvertently introduce conversion errors, and it is often necessary to screen data received from other sources.

How can a research scientist who lets errors as obvious as these to sslip through into published work be described as "vigilant"?

Doug Jenkins
Interactive Design Services
http://newtonexcelbach.wordpress.com/

RE: 20% of Scientific Papers On Genes Contain Conversion Errors Caused By Excel, Says Report

I agree, not very vigilant.
If we are using a more traditional programming language, it usually forces us to tell it what type of data going into each variable.
An Excel worksheet does not force us to tell it the type of data going into each cell, so we should expect some problems if we don't tell it what type of data (by preformatting).
Thanks skip for the FAQ's that made it pretty simple to grasp.


=====================================
(2B)+(2B)' ?

RE: 20% of Scientific Papers On Genes Contain Conversion Errors Caused By Excel, Says Report

Every technical user ought to 1) be aware of and 2) be thinking of the fact that data conversion is happening every time they enter data into a spreadsheet. So it is incumbent upon each of us to verify our process and results, especially when such a tool is used in "mission critical" tasks. Every column of a spreadsheet table ought to have a proper Number Format applied BEFORE the table is used/filled. Structured Tables make the maintenance of such formatting as a table grows in size, seamless.

One of the other areas that often causes problems is numeric IDENTIFIERS. data like Invoice Numbers, Item Numbers, Part Numbers, Postal Numbers, Member IDs (and the list goes on): numeric data on which you will NEVER perform a mathematical operation, are often stored as numbers. WHY? It is asking for trouble that an IDENTIFIER would be entered and stored in a format that IS CONVERTED INTO SOMETHING EVERY TIME, and therefore is susceptible to being converted into something unwanted and is not in itself a meaningful number. Every identifier ought to be entered/stored as TEXT, as described in the FAQ. You might need a work-around when joining to other databases but it's worth the effort.

Skip,

glassesJust traded in my OLD subtlety...
for a NUance!tongue

RE: 20% of Scientific Papers On Genes Contain Conversion Errors Caused By Excel, Says Report

In addition to fractions converting to dates, I get plagued by UK-US date conversion. 1/8 becomes 01-August (UK format) but 1/16 becomes 16-January (US format). Somehow, even though everything is UK, I still end up getting US dates. It wouldn't be so bad if I got 02 Jan, 04 Jan, 08 Jan - then I know it is all US format. If I got 01 Feb, 01 Apr, 01 Aug then I know it is UK format but when half converts to UK and the other half to US, it is just painful to sort out. 2016 is a particularly bad year - sometimes, 3/16 converts to 01 mar 2016 and sometimes 1/2 converts to 1 Feb 1900! I now use a macro to check the "fractions"

RE: 20% of Scientific Papers On Genes Contain Conversion Errors Caused By Excel, Says Report

That would be extremely frustrating. WOW!

So are you saying that if you entered those examples on your sheet right now, that there would be absolutely no consistency in the date conversion, entering each of the fractions you mentioned, or does this happen under different circumstances?

 
I'm very interested in this, since I wrote the FAQs that I posted above and would like to understand the possible pitfalls related to data conversion.

Skip,

glassesJust traded in my OLD subtlety...
for a NUance!tongue

RE: 20% of Scientific Papers On Genes Contain Conversion Errors Caused By Excel, Says Report

It is when I cut from IE or Acrobat and paste it in Excel. It is particularly bad on the 1/16. Next year it should be OK again. The next problem year will be 2032.

RE: 20% of Scientific Papers On Genes Contain Conversion Errors Caused By Excel, Says Report

Well I can't duplicate this anomaly and it may be that I'm using USA format MDY and Bill designed Excel in the state of Washington.

You do realize that if you want to enter fractions you must format the cell/range as TEXT before entering any data. Changing the Number Format after the fact does nothing, because a Number Format is merely a display feature. Having entered a fraction as text, it renders the value virtually useless as a numerical value.

Skip,

glassesJust traded in my OLD subtlety...
for a NUance!tongue

RE: 20% of Scientific Papers On Genes Contain Conversion Errors Caused By Excel, Says Report

Happens on both 2013 and 2016.

Go to http://russell.htpgusa.com/products/air-cooled-con... get the pdf for small condensers. Copy the table on page 6, paste it in excel, then use text to columns, delimited by spaces. 12-1/2 becomes 12/01/2002, 31-9/16 becomes 31/09/201616! Anything that can be converted to a date will be.

RE: 20% of Scientific Papers On Genes Contain Conversion Errors Caused By Excel, Says Report

Well why haven't you used the TEXT Column Data Format on Step 3 of 3 of the Text to Columns wizard?

You're shooting yourself in the foot!

Skip,

glassesJust traded in my OLD subtlety...
for a NUance!tongue

RE: 20% of Scientific Papers On Genes Contain Conversion Errors Caused By Excel, Says Report

Happens regardless of whether I choose general or text.

RE: 20% of Scientific Papers On Genes Contain Conversion Errors Caused By Excel, Says Report

Well I'm at a loss to explain your experience. I performed

1) copying the table in question,
2) pasting into a cell, which, BTW, resulted in one long string,
3) started the Text to columns wizard,
4) chose Delimited in Step 1,
5) Checked Space in Step 2,
6) in Step 3 FIRST made sure that ALL the columns were selected (darkened) in the data preview and then selected the TEXT Column data format, and finally
7) FINISH.

I suspect that you failed to SELECT the columns to import as TEXT in Step 3 of the Text to columns Wizard.

BTW, in Step 3, selecting all the columns (63) one at a time would be tedious: select the first, scroll to right, SHIFT + SELECT last column to select all.

Skip,

glassesJust traded in my OLD subtlety...
for a NUance!tongue

RE: 20% of Scientific Papers On Genes Contain Conversion Errors Caused By Excel, Says Report

try this...

CODE

Sub ParseColumns()
'SkipVought 9/16/2016 Eng-Tips
'copy table page 6 in http://russell.htpgusa.com/images/documentation/products/condensers/RU-RDX-0816A.pdf
'paste into A1

    Range("A1").TextToColumns _
        Destination:=Range("A1"), _
        DataType:=xlDelimited, _
        TextQualifier:=xlDoubleQuote, _
        ConsecutiveDelimiter:=True, _
        Tab:=False, _
        Semicolon:=False, _
        Comma:=False, _
        Space:=True, _
        Other:=False, _
        FieldInfo:=Array(Array(1, 2), Array(2, 2), Array(3, 2), Array(4, 2), Array(5, 2), Array(6, 2), _
        Array(7, 2), Array(8, 2), Array(9, 2), Array(10, 2), Array(11, 2), Array(12, 2), Array(13, 2), _
        Array(14, 2), Array(15, 2), Array(16, 2), Array(17, 2), Array(18, 2), Array(19, 2), Array(20, 2), _
        Array(21, 2), Array(22, 2), Array(23, 2), Array(24, 2), Array(25, 2), Array(26, 2), Array(27, 2), _
        Array(28, 2), Array(29, 2), Array(30, 2), Array(31, 2), Array(32, 2), Array(33, 2), Array(34, 2), _
        Array(35, 2), Array(36, 2), Array(37, 2), Array(38, 2), Array(39, 2), Array(40, 2), Array(41, 2), _
        Array(42, 2), Array(43, 2), Array(44, 2), Array(45, 2), Array(46, 2), Array(47, 2), Array(48, 2), _
        Array(49, 2), Array(50, 2), Array(51, 2), Array(52, 2), Array(53, 2), Array(54, 2), Array(55, 2), _
        Array(56, 2), Array(57, 2), Array(58, 2), Array(59, 2), Array(60, 2), Array(61, 2), Array(62, 2), _
        Array(63, 2)), _
        TrailingMinusNumbers:=True
End Sub 

Skip,

glassesJust traded in my OLD subtlety...
for a NUance!tongue

RE: 20% of Scientific Papers On Genes Contain Conversion Errors Caused By Excel, Says Report

That's exactly what I did. The way I got around it was to preset the columns to fractional format before converting.

RE: 20% of Scientific Papers On Genes Contain Conversion Errors Caused By Excel, Says Report

Help me out. Are you stating that if you were to 1) paste in that PDF table to A1 and then 2) run my procedure, that the date conversion still occurs?

Also, you stated, "before converting." Could you explain what it is you were converting?

Skip,

glassesJust traded in my OLD subtlety...
for a NUance!tongue

RE: 20% of Scientific Papers On Genes Contain Conversion Errors Caused By Excel, Says Report

Yes - if I paste the PDF table to A1 and then go from step 3 to 7, the fractions get converted to dates.

Sometimes they get converted to US dates, sometimes to UK dates. I didn't use the sub as there are lots of tables from lots of manufacturers and every one is different. Rewriting the sub for every table would be a nightmare. Sometimes, I have to input the table manually because the PDF just contains a bitmap ponder

If I note which columns should be fractions and preset the format of those column to fractions before running steps 3 to 7, everything comes out correctly.

DrawOh - apologies for hijacking your thread

RE: 20% of Scientific Papers On Genes Contain Conversion Errors Caused By Excel, Says Report

Have you tried to just copy the PDF table into Word and then copy it into Excel? You may need to do a Paste Special as text (the table at page 6 of http://russell.htpgusa.com/images/documentation/pr... ended up converting the 12-1/2 into a date if you don't use the Paste Special).


RE: 20% of Scientific Papers On Genes Contain Conversion Errors Caused By Excel, Says Report

Quote:

Yes - if I paste the PDF table to A1 and then go from step 3 to 7, the fractions get converted to dates.

You and I are talking about two completely different things it appears. The Text to column wizard, Excel 2013, has THREE steps, not SEVEN. What feature are you using?

This lead me to believe that we are, maybe not worlds apart but at least not referring to the same process. The reason for me posting the macro was to determine if my process was the same as your process. Apparently NOT. I want to understand whats going on, because this is one of the features in Excel that is often misunderstood and misused, as the subject of this thread testifies.

Can, will you continue to pursue this with me? TIA

Skip,

glassesJust traded in my OLD subtlety...
for a NUance!tongue

RE: 20% of Scientific Papers On Genes Contain Conversion Errors Caused By Excel, Says Report

Yes!!!

Skip,

glassesJust traded in my OLD subtlety...
for a NUance!tongue

RE: 20% of Scientific Papers On Genes Contain Conversion Errors Caused By Excel, Says Report

Quote (Skip)

You and I are talking about two completely different things it appears. The Text to column wizard, Excel 2013, has THREE steps, not SEVEN. What feature are you using?

Skip - your post of 11:55, 16-Sep has 7 steps!

But if I follow the process with the table from the linked pdf in Excel 2016 it works the same as shown by IRStuff above. The only difference from your post is I get 7 rows of data in a single column when I paste, not everything in one cell.

Another point is that if you go through the text to columns process then paste the same data somewhere else, this time it will be pasted in columns, but it doesn't remember the format setting, so the fractions come in as mixed date and text. You have to set the format to text over the whole range of the table before pasting, then it will paste as text in columns without going through the text to columns process. This happens even if you re-copy the data, or open a new workbook and paste there. As far as I can see the only way to make it forget and go back to pasting in a single column by default is to close Excel and re-start.

Doug Jenkins
Interactive Design Services
http://newtonexcelbach.wordpress.com/

RE: 20% of Scientific Papers On Genes Contain Conversion Errors Caused By Excel, Says Report

Oops blush. My bad for mixing my steps as numbers with the Text to column wizard which has numbers.

xwb, I am sorry for causing this confusion.

Doug, thanks for pointing this out to me.

I am puzzled why my copy/paste produces one long string without any line breaks when I paste into Excel.

I did paste the table here and then copy/pasted into Excel and that produced 7 cells in one column, from which I performed the Text to columns.

But I'm still very interested in understanding why the process fails for xwb. xwb, if you would run my procedure and report the results, it would help my understanding.

BTW, before pasting your table into Excel, 1) ENTER any value into A1, 2) SELECT A1, start the Text to columns wizard and DESELECT Space in Step 2 of the DELIMITED option and FINISH.

Then paste in your table in A1. Now you're ready to do the parsing.

Skip,

glassesJust traded in my OLD subtlety...
for a NUance!tongue

RE: 20% of Scientific Papers On Genes Contain Conversion Errors Caused By Excel, Says Report

The string problem in Excel is because it doesn't recognize the newline character that the pdf uses. Other programs, like this browser, or notepad++, are able to interpret the string as we see it in the pdf. The converse is often required. I often copy article titles from pdfs to use as their file names, and windows explorer does recognize the new lines, and I have to paste into Chrome or Internet Explorer to get rid of the new line break.

TTFN (ta ta for now)
I can do absolutely anything. I'm an expert! https://www.youtube.com/watch?v=BKorP55Aqvg
FAQ731-376: Eng-Tips.com Forum Policies forum1529: Translation Assistance for Engineers

RE: 20% of Scientific Papers On Genes Contain Conversion Errors Caused By Excel, Says Report

It is probably a UK-US locale-date problem. I guess you may get a similar problem with Japanese/Chinese style dates which are yy/mm/dd. You won't get that many problems running from a US locale.

I have found that, even with macros, sometimes Excel just ignores the locale and assumes US until the 13th, then it switches to UK. I've had macro generated spreadsheets which copy from IE where the first 12 days of the month get switched to US format and the rest come out in UK format. The only way I've found that always works is to preset the format of the column before writing to it.

RE: 20% of Scientific Papers On Genes Contain Conversion Errors Caused By Excel, Says Report

Still trying to understand what's going on.

Here is a list of various ways I have manually entered 12-1/2, which appears to represent a numerical value 12 1/2 or 12.5 in the subject table. In the case of entering an intended fraction I entered 12 1/2 (with a SPACE)


Preset	Entered	Display 	Actual
Format	Value	Value   	Value
Fract'n	12-1/2	12/1/2002	37591
Fract'n	12 1/2	12 1/2   	12.5
Text	12-1/2	12-1/2  	12-1/2
 

The only preset number format that "sticks" is TEXT. Presetting the format as Fraction is overridden by Excel's 'rules' for date entry/conversion. 12-1/2 or 12/1-2 will both result in a date regardless of presetting the cell number format.

Can anyone else, on the east side of the pond, enter 12-1/2 to result in a fraction other than presetting the format to TEXT which results in no fraction at all but rather a visual to be represented as a fraction.

Skip,

glassesJust traded in my OLD subtlety...
for a NUance!tongue

RE: 20% of Scientific Papers On Genes Contain Conversion Errors Caused By Excel, Says Report

Skip - with Australian English settings it works the same as shown in your post, except the date number is 37268 (12-Jan-2002).

I'm not sure that allowing 12-1/2 to be entered as a number is a good idea, because if entered anywhere other than the first value in an equation its value should be 11.5, not 12.5.

Doug Jenkins
Interactive Design Services
http://newtonexcelbach.wordpress.com/

RE: 20% of Scientific Papers On Genes Contain Conversion Errors Caused By Excel, Says Report

The intention of the table in the PDF is a fraction: 12 1/2, although there's a DASH in there. So to prevent a date conversion, it must be entered in a cell with TEXT number formatting.

So you are stating that the date conversion rules in Australian English settings (and most probably in Great Britain as well), assumes MDY, which I now admit, shoots down my assumption that the conversion is always MDY. So Excel is smarter than I thought, which is better. I want to modify my FAQ to reflect conversion in accordance with regional settings.

My angst is trying to reconcile xwb's experience.

Skip,

glassesJust traded in my OLD subtlety...
for a NUance!tongue

RE: 20% of Scientific Papers On Genes Contain Conversion Errors Caused By Excel, Says Report

In the discussion above Skip said that when you copy and paste from a pdf, the text is pasted as a long string in a single cell, but I found that it was pasted in rows, as in the original table.

That was last week.

This morning I found my system had been updated overnight, and when I did the same process, with the same file, I found the text pasted as a long string in a single cell.

Strangely there seems to be nothing about this anywhere on the Internet, at least I couldn't find it, but with some experimenting I found:

  1. It's not an Excel thing. The same table pasted into a text editor also appeared as a single cell.
  2. I then noticed the pdf file had not opened in Adobe Reader, it was opened in Edge. It seems that Microsoft think they know better than I do about what my default programs should be.
  3. Switching back to Adobe Reader the copy and paste worked as before, pasting as 7 rows.
So the message is, if Windows does an automatic update, check your file associations, particularly for opening pdfs.

Doug Jenkins
Interactive Design Services
http://newtonexcelbach.wordpress.com/

Red Flag This Post

Please let us know here why this post is inappropriate. Reasons such as off-topic, duplicates, flames, illegal, vulgar, or students posting their homework.

Red Flag Submitted

Thank you for helping keep Eng-Tips Forums free from inappropriate posts.
The Eng-Tips staff will check this out and take appropriate action.

Reply To This Thread

Posting in the Eng-Tips forums is a member-only feature.

Click Here to join Eng-Tips and talk with other members!


Resources


Close Box

Join Eng-Tips® Today!

Join your peers on the Internet's largest technical engineering professional community.
It's easy to join and it's free.

Here's Why Members Love Eng-Tips Forums:

Register now while it's still free!

Already a member? Close this window and log in.

Join Us             Close