![[personal profile]](https://www.dreamwidth.org/img/silk/identity/user.png)
![[site community profile]](https://www.dreamwidth.org/img/comm_staff.png)
Named Markup Formats
I've been working on a thing, and I could use some feedback on the implementation. It might take a little explaining (because there's a fair amount of backstory), but I'll try and be as concise as possible.
Backstory: Raw and transformed text
DW stores the text of entries and comments raw, exactly as the user entered it. Then, whenever we need to display that text, we transform it to produce nice legible HTML. Those transformations include:
- Turning
<user>
tags (which aren't real HTML) intouser, which is really like a span plus an image plus a link.
- Handling
<cut>
tags. - Several other things, etc., not important right now.
Most of those get applied to everything we display. But there are also some OPTIONAL transformations:
- Turning normal line breaks into HTML
<br>
tags. - Turning bare URLs into clickable links.
Those get applied by default, but we've always had a "don't autoformat" checkbox (inherited from LJ) that could be used to disable them for an entry or comment. (BTW, under the hood the RTE saves entries as "don't autoformat.")
Then, later, DW added some other optional transforms, which had their own special enabling conditions:
- Turning Markdown into HTML. (For entries that start with a special
!markdown
glyph, or comments submitted by email.) - Turning
@mentions
into user tags. (Currently applies to everything except "don't autoformat," but gets suppressed within certain HTML elements or their Markdown equivalents.)
ALL of these transforms get handled by something called the "html cleaner," at LJ::CleanHTML. At this point "cleaner" is kind of a misnomer; in actual fact, it's the central place where we handle all transformations of raw user-entered text into a fragment of display HTML.
The problems
In my understanding, the current state of affairs has two main problems:
- The interface for choosing text transforms is incoherent. That happened gradually; we've added new transformations over time, and changed the interactions between them, and now it's weird:
- Half of the interfaces for entering entry/comment text don't even include the "don't autoformat" checkbox anymore.
- The way you enable Markdown has always been a mystery. For example, I want to use Markdown in comments (because typing html angle brackets on a telephone is bullshit), but currently it's impossible except when responding via email.
- Introducing new text transforms is dangerous and chaotic. In mid-2019, we enabled
@mentions
for HTML-formatted content (previously they only worked in Markdown content), and about 40% of hell broke loose:- Current content suffered because we didn't have a good way to beta-test
@mentions
, so we didn't have a chance to learn about bugs and edge cases from our users (who are more ingenious at doing weird textual shit than we are) before enabling them globally. - Old posts weren't written to expect
@mentions
, so we ended up totally vandalizing any historical post that ever discussed CSS code (@media (min-width: etc...)
), Perl or Ruby or Objective-C code, or a wide variety of other things that involve @ signs.
- Current content suffered because we didn't have a good way to beta-test
Questions: Does anyone disagree with those two problems or my characterization of them? Does anyone see any closely related problems that I'm not recognizing here?
The solution, maybe
I've got a pull request up for discussion right now that tries to solve these immediate problems, as well as several other problems I can see coming down the road at us.
In brief (lol, sorry, I'm trying!!), here's what I'm proposing we do:
- Instead of a set of disconnected text transform options that can be independently twiddled, we codify a set of named markup formats. Each entry or comment will specify exactly one format.
- To start with, this will include all of the valid combinations of the current formatting options, which ends up being fewer than you might think:
markdown0
- Markdown via the Text::Markdown module, plus our own@mentions
.html_casual1
- Classic HTML-but-it-respects-your-linebreaks, plus@mentions
.html_casual0
- Like casual 1, except without@mentions
. Old content was written to expect this, but we don't respect that anymore. Imported and syndicated content still uses this, though.html_raw0
- "Don't autoformat." No@mentions
.html_extra_raw
- Syndicated content that we know will never use special DW tags like<user>
or<cut>
.
- (For existing posts that don't specify a format, we use the existing independent options to guess a format. We also take the post date into account, so we can restore casual 0 on posts that were originally written in it.)
- To start with, this will include all of the valid combinations of the current formatting options, which ends up being fewer than you might think:
- When formatting text for display, the specified format determines all of the transforms we use.
- No more independent twiddlies. The format captures the entirety of the user's intent for how their markup should be handled.
- Formats are versioned.
- If we want to introduce new formatting behavior (like
@mentions
) or make changes that might have non-linear consequences (like switching from Text::Markdown to a different Markdown processor), we have to implement it as a new format (e.g.markdown1
). - Old versions stick around in the HTML cleaner FOR ALL ETERNITY, so that we can continue to display content as it was originally intended.
- This doesn't preclude making small changes for safety or consistency -- for example, putting a user tag inside a link always used to generate illegal link-within-a-link markup that we'd just ignore and let your browser sort out, but I recently made it so we strip the inner link. But anything that changes what a user would expect to get from their markup should cut a new format.
- If we want to introduce new formatting behavior (like
- When posting or editing a comment or entry in the web UI, the form includes a drop-down for choosing which format you want to use.
- (With a descriptive name, not weird IDs like
html_casual1
.) - We only offer users the "active" formats -- although we keep around legacy formats for displaying old content, we don't clutter the UI with them. Today, the active formats for non-syndicated posts would be
markdown0
,html_casual1
, andhtml_raw0
.- The exception is if you're editing older content that uses a legacy format. In that case, the menu includes the active formats PLUS the post's existing format, so you don't have to update all formatting within the post just to make a minor change.
We remember the last format you chose when making a new post, and use that as your default.- (new detail) If you deliberately change the format to something other than your current default, we give you the option to set that as your new default after saving your post/edit.
- If we're introducing a new format, we might set it as active it but mark it as beta, and keep the format it replaces for a while. When posting with a beta format, you're accepting the possibility that the markup behavior might retroactively change for that post, until the beta ends.
- (With a descriptive name, not weird IDs like
- The selected format in a comment/entry form might change other behavior of the form.
- Specifically, I'm thinking of markup helper buttons for the most common text styles, which is another back-burner project I'm working on -- the bold/italic/link buttons would add HTML tags for the HTML formats, and Markdown styles for Markdown.
- Also thinking of the future replacement RTE here -- RTE would get its own named format (even though practically speaking it will probably emit something that looks like
html_raw0
), and switching your format to "Rich" would initialize the editor and its controls. Switching your format to something non-rich would serialize the editor's buffer as HTML and take you back to a plain textarea; we wouldn't need a separate control for switching to the RTE beyond the normal format drop-down that everything else uses. - ...I guess we could also add syntax highlighting or something for HTML and Markdown, but I don't think anyone's planning on it and I doubt anyone's desperate for it.
- (new detail) When posting via email, we'll add a new post header for formatting. You can specify either a format ID (like
html_casual0
), or a shortcut ID to choose the most recent versions of casual HTML or Markdown. If you omit the header, it posts with the current global default for that type of content, ignoring your user default.- Why: Email is sort of halfway between a UI and an API. People who are automating email posts want to choose a stable and predictable format (so use IDs), but people who are writing in their real mail client probably want their posts to act like the version of the web form they're accustomed to. We'll ignore user defaults because your email client doesn't give you any feedback about what's going to happen, so we don't want it to depend on hidden settings state.
- (new detail) When posting via the new API, you can specify a format ID or fall back to a global default. User default is ignored.
- (new detail) Email and API posts aren't limited to the currently active formats; they can specify any format they want, including obsolete ones. This should help us avoid breaking automation that people build.
- (new detail) When posting via the old XMLRPC API (so, old LJ clients like Semagic), you cannot select a named format. You'll be limited to today's formatting options (the "don't autoformat" checkbox if supported by the client, and the
!markdown
secret glyph), which will forever behave exactly like they do in mid 2020 (resulting in eithermarkdown0
,html_casual1
, orhtml_raw0
).- Why: Old client programs want things to keep working the way they've always worked... so that's what they'll get. And we don't want to add more secret glyphs, because they make things harder to maintain and understand.
- (new detail) If a browser extension happened to mangle the web forms to re-enable obsolete formats, we wouldn't care. From our perspective, that's the same thing as an API client posting with a weird format, which is fine. (Removing obsolete formats from the menu is just about keeping things simple for most users.)
I THINK this will free us up to be much more nimble about modernizing the way we process text, and will increase user choice while ALSO making the site less confusing to use, which is a rare combo.
Questions: Does the approach make sense? Can you think of anything it would sabotage or prevent us from doing in the future? Can you see anything excessively complicated in the code itself? (By which I mean, can we make this simpler while still solving those two problems.)
Non-linear consequences
Here are the things I can think of that might be affected in weird ways by this change:
- External clients and integrations. We'll be expecting new entries and comments to arrive with a format specified (using the new-ish
editor
property), so anything that posts content without that will be locked into the "guessed" format that fits their other options. (that's raw 0 if they set "don't autoformat," and casual 1 if they didn't.) Over time, that'll drift more and more out of date; when we add some new@mention
-like thing that forces us to cut a casual 2, we're going to keep guessing casual 1 for posts with no metadata, because all of OUR stuff properly sets the format type.- TBH, I think this is 100% fine. Feel free to make a case for otherwise, tho.
- All the documentation/FAQ pages about formatting stuff will be immediately out of date.
- On it, don't worry.
- Once we finally make Markdown discoverable, people might start requesting we add other oddball lightweight markup formats, like textile or RST.
- I think the answer to that is basically "no?" It should be "no."
- Another thing that might happen is that we start getting pressure to add some of the niceties of more modern Markdown implementations, like fenced code blocks and ascii-art tables.
- Well, at least we'd be able to do that safely, by cutting a
markdown1
and leaving existing content onmarkdown0
. It's just a question of what we actually want our Markdown to act like, and THAT is going to be an exciting conversation.
- Well, at least we'd be able to do that safely, by cutting a
- Doubling down on the importance and centrality of the HTML cleaner, which is a highly complex thicket of code that not a lot of people feel comfortable working with.
- I'm not aware of any plans to replace the cleaner, but this would definitely require them to be rethought. That said, we're always going to need SOME central thing that governs text transformations, and we could move the implementation to a different spot in the code if need be.
Please suggest more of these if you think of them.
no subject
* User goes to start an entry
* User picks a format at semi-random
* User posts
* User discovers that it was not the intended format
* User edits to pick intended format
[some time passes]
* User makes new entry
* Format of new entry retains the original, incorrect, randomly selected format
* User believes that last entry was posted with correct format, is annoyed that this one isn't
* User edits most recent entry
* Cycle repeats
no subject
How about: when editing, the format switcher has an “also use this format next time” checkbox by it?
...Not yet sure that wholly solves it, also gotta find space for that checkbox and its label somewhere.
(no subject)
(no subject)
(no subject)
RTE
Re: RTE
Re: RTE
Re: RTE
no subject
no subject
no subject
I would like to mention tangentially, from a relatively technical end-user perspective, that the inconsistency in places is horribly grating even when I do know how to and often prefer to write raw HTML. Specifically, though I'm not sure where this fits into the above, the amount of
<!--
, newline,-->
in our profile text because there's no way to disable having a<br>
jammed into the middle of what would otherwise be a void between list items or something.Also, I've gotten into the irritating habit of never writing newlines in comments (I'm doing it right now, in fact), because I have an Aversion to having
</p><p>
be rendered as<br><br>
, and the “no, don't give me this comment form, give me the one which allows me to do that but also doesn't show context in a remotely ergonomic way etc.” alternative got too grating to use over time. I take it the Markdown conversion does better at this?(no subject)
(no subject)
Reduce complexity first?
E.g., is it possible to delete "Disable Auto-Formatting" option?
Re: Reduce complexity first?
- If you mean taking away the ability to do raw HTML without getting your linebreaks mangled, that can't go away. It's an important escape hatch that lets people post things that were generated by other programs. (For example, using Scrivener or Word to export a chapter of fanfic as HTML.)
"don't autoformat," checkbox
Re: "don't autoformat," checkbox
Re: "don't autoformat," checkbox
Re: "don't autoformat," checkbox
"Create Entries" Beta page
Further cleanup of "Create Entries" Beta page
Re: Further cleanup of "Create Entries" Beta page
Re: Further cleanup of "Create Entries" Beta page
Re: Further cleanup of "Create Entries" Beta page
Re: Further cleanup of "Create Entries" Beta page
Re: Further cleanup of "Create Entries" Beta page
Re: Further cleanup of "Create Entries" Beta page
Re: Reduce complexity first?
no subject
As the person primarily responsible for much of the mess we're in today, the driving goal behind Markdown and @-mentions is to make Dreamwidth easier to use and more accessible (as a technology). Having to learn to write HTML is not something one should _have_ to do in order to express oneself, and so I strongly feel that we should be moving in directions to make it easier.
That said, we are of course built on the idea of customization and being able to really go deep in what kinds of things you can do to your content. That has many positive things, but ultimately, LJ::CleanHTML is the kind of thing you end up with when you want to make real ultimate cosmic power but also combine it with a reading page that has to combine _my_ journal customization insanity with _your_ entry customization insanity. God help us all.
So here we are. You're untangling my hacks, and I appreciate that.
I think basically I agree with your proposal and what you're thinking here, and I concur that we should draw the line at Markdown (and possibly some additions to it, since the Perl Markdown renderer is pretty minimal).
Ultimately tho, it doesn't look like this will really _change_ user behavior/experience? They will see a new dropdown/UI change that lets them explicitly select their formatting, which will save itself (but I saw the comments above, which is a good point to try to make sure we don't confuse/annoy users by changing their defaults unexpectedly).
One thing not mentioned, but I think because it's obvious but I'll say it anyway so it's documented, crossposting doesn't change here right? Since we just render that out as HTML and pass it raw to the target site and ask them not to mangle it. That should work just fine.
(As an aside, something I've been pondering as I've looked at LJ::CleanHTML in the past is if we could drastically simplify it with some modern technologies. It does a lot of work to try to block things like scripts, but possibly we can just use CSP or such to block those by policy and then simplify? I suppose this doesn't fix a lot of the "HTML fixing" we do though, so ultimately it probably doesn't actually simplify the cleaner by any material amount. Oh well.)
no subject
Yes, exactly!! Now that we're pre-rendering everything and force-setting opt_preformatted at the destination (instead of trying to work within a foreign version of casual html), we're home-free no matter WHAT we do. As long as we're:
- Able to render non-broken HTML for our own site
- Translating things that DO directly correspond to magic LJ markup (right now that's just @mentions, but if we ever made a shorthand syntax for e.g. cut tags we'd want to handle that too)
...then crossposting stays fine.
That's my hope, yeah — the user side of this should just feel like we replaced the "don't autoformat" checkbox with an easier-to-understand dropdown. The only complex bits from the user perspective are:
- How we set user defaults (cf. above).
- The combination of "current format affects editor controls" and "RTE is just another format". This makes perfect sense to me, but if we, e.g., ended up with an RTE that had its own integrated "view source" mode, we could end up in a nested hierarchy, where I chose rich formatting but I'm viewing/editing HTML code as part of the implementation of "rich." Which I think I don't actually have a problem with, but it might be worth thinking through those contradictions.
- How we handle obsolescence of formats. (The "but I liked the old way" effect -- would we want to offer a backdoor way to keep your default locked to a hidden format? So far the consensus of staff and volunteers sounds like "no." I'm inclined to agree.)
Word. IMO the security part of CleanHTML contributes a lot to its gnarliness, so even if the text transformation part is always going to have to happen somewhere, it'd still be super cool to make the security part more modern and concise. But when it comes to XSS security, gotta admit that I'm Baby, so I don't have a useful contribution there yet.
(no subject)
(no subject)
no subject
I agree,
roadrunnertwice has done a really nice job thinking this through.
It might just be my age and remembering a time before CSP, but I'd pretty strongly recommend against relying on CSP alone. CSP is great as a backup for when something gets through, but there are a number of cases where even a perfectly-specified CSP is not going to work very well:
More generally, I'd argue that relying on CSP would move security issues from DW's control to third-party control (browser vendors, cross-post targets, e-mail clients, feed readers), which is a lot of faith to put in other people getting security right. Yes, use CSP! But in conjunction with existing security measures, not as a replacement.
It might be possible to simplify things by layering them—one layer to do input transformations, and another for security transformations, say—but that would require very careful design if input transformations need to produce things that would be rejected by the security layer (so no reserved HTML classes or ids, at least without shenanigans like using a random nonce or signed token that gets replaced with the reserved value).
no subject
(no subject)
no subject
I think editing a post to a different format should also set that format as your default, if we’re going with using a format sets it as default. I think if anything there should be a ticky (possibly on an interstitial page) asking to check if the user doesn’t want it set as the new default.
Otherwise just don’t set it automatically at all anywhere, and always include a ticky for making it your default. Those are the two options I see. I don’t see a compelling reason for the editing page to behave differently than post or comment.
no subject
It's mostly in case of editing entries that are in an obsolete format, or an active format you were using before you realized you preferred a different one -- the old clothes might not fit the new you.
Anyway, I'm leaning toward being more explicit; most appealing thing I've thought of so far is: if you save a post in something that's NOT your default format, you end up on a "success" page that includes a "set this format as my default" button.
Edit: Oh, just realized I didn't answer your first question. 😑 For email and API posts, the user's default format shouldn't apply; we'll use the specified format if there is one, and fall back to a global default otherwise.
Reason being, email and API don't give you any feedback about what your settings are before you hit send. So it'd be like, "hope you remember exactly what you set previously," which, bad! So for automated or semi-automated stuff like that, deterministic consistency is way more important than convenience.
no subject
tags where you didn't want them, even repeatedly. (I don't have specific reproducible steps but I'm fairly certain it wasn't just a user error.)
Otherwise, I think this is a very neat solution to the problem and can't see any obvious issues with it.
(Also, yay for markdown in comment forms! I'm not very taken with the current DW markdown flavor, but it still beats writing HTML markup by hand.)
no subject
For rich text, it gets squirrelly, like you've noticed and like
- Switching to rich text if you have some current text entered will try to translate it to fit the RTE's internal model. We'll want to be able to deserialize both HTML and markdown, and we can probably do something to distinguish between raw and casual HTML to preserve most of the user's intent.
- But that's a lossy transformation, and there's no way around that.
- So it should warn you about that, and give you a chance to cancel out.
- It should also warn you when switching OFF the RTE -- the transformation to plain HTML would be non-lossy, but if you're about to try and tweak something and then back to rich mode, you'll hit a lossy transformation step again, so you should be aware of that beforehand.
html_casual0
@mentions
, and have been using @ to mark Twitter usernames in attributions for quotations. So the ability to select html_casual0, especially in post-by-email, would be useful to me. (I've currently got the usual way Twitter IDs show up in my entries special-cased in the scriptbut sometimes they're not in quite that context and I have to hand-edit my entry after it's posted.)
Basically, I'm trying to post as though it were syndicated, despite not using a syndication API. So it'd be nice if I had access to a syndication-friendly input format.
(Also, is there a way to specify "do not autoformat" in post-by-email? My script currently tries to reformat from html_casual0 into something suitable for allow-autoformat.)
The other reason I'd like to keep an html_casual0 option is that it's easier for me to edit in HTML with line breaks in my source, but specify presentation linebreaks with <p> and <br />, which is what I used to get with "clive -p". So my script'd be more reliable if I can strip out the bit where it attempts to remove existing linebreaks and replace <p>..</p> and <br /> with new linebreaks.
I have no idea how many other people are still using command-line clients to post entries though. (BTW, if anybody's hacked Clive to work with DW again like it used to, and are willing to share your diffs, I'd be grateful. I was using Clive 0.4.9 when the DW API changed just enough to break it, but looking at the Changelog for Clive 0.4.10 it doesn't look like any DW support was added to it.)
Re: html_casual0
Couple of things:
- Today, email posts can't set formatting options. (Comments are locked to Markdown, and posts are locked to default behavior i.e. "casual 1".) But tbh, why not? If we go forward with this, I'd be in favor of a `post-editor` header that lets you choose your own format.
- Same thing for posts via the new API; we'd let you set the format of your choice.
- About obsolete or weird formats: I actually don't care whether API clients (or email) use these; in fact, there's very good arguments for letting semi-automated stuff like that use every valid format. So my current implementation does nothing to block it, and you'd be able to use html_casual0 just fine. (The thing I DO care about is keeping the post/reply web UI clean.)
Although: it kind of sounds like you'd rather post in html_raw0, since you're entering your own <p> and <br> tags anyway.
no subject
Also I use Semagic to post to DW, so I wouldn't be able to select a format for each post. Instead of having the site guess what format to use, would it be possible to create a "!markdown" like command to put at the head of a post to indicate the format? Or maybe have an option in settings where a user can choose a default format for new posts.
no subject
Old clients like Semagic are gonna be out of luck on this one, I think -- they'll be locked to mid-2020 behavior (that is: default html_casual1, optional html_raw0 if they support a "don't autoformat" checkbox, markdown0 if you start the post with !markdown) for as long as they keep working. Reasons being:
- I'm trying to get AWAY from weird secret glyphs like !markdown, so I don't want to add more of them.
- Adding new global settings is bad.
- Being able to set a "hard default" like that for API posts would introduce really confusing and unexpected behaviors, I think — instead of "no format specified" having a single reliable effect, it could have any number of effects depending on remote settings. In fact, that could cause existing clients to break even faster than they're already breaking. Better to have API calls be predictable.
(no subject)
no subject
no subject
- For just commenting, you can log in with your dreamwidth openID without having to sign up.
- if you want to post entries too, I think the invite code “ 5GNA9R3SWM656AAAAAAN” is still active.
- Anyone else reading this wants an invite code, just say so.
(no subject)
no subject
no subject
no subject
----
1) What happens if you find a security issue in an older formatter and need to overhaul or replace the implementation? (Perhaps the implementation is so broken it *can't* be fixed, and needs replacement.)
Can DW reasonably do a search to sample for posts and comments where the change makes a meaningful difference in output? And what happens if there's a discrepancy?
Technically, your proposal wouldn't change this situation all that much, since DW *already* has this problem. But it might offer more solutions. For example, you could "clone" markdown1 as markdown1_pre_20210101, set old markdown1 posts to use it, and fix markdown1 in place. If new posts are *forbidden* from using markdown1_pre_20210101, and editing an old post bearing that formatter switches it to markdown1 so that people can't use them to host exploits, then you could have your cake and eat it too...
This would be more complexity, and I'm not suggesting it's a great idea, but I think it's worth considering how that might play out. Old libraries sometimes have to be entirely replaced, and those corner cases can be *hard* to replicate.
(Another possibility is rendering all the old content to HTML in order to freeze it, and storing that alongside the source markup.)
----
2) You write about the scenario where someone edits an old post:
« The exception is if you're editing older content that uses a legacy format. In that case, the menu includes the active formats PLUS the post's existing format, so you don't have to update all formatting within the post just to make a minor change. »
What happens if I edit an old post and change the format selector to a newer format, and realize I've made a mistake and actually need it to go back to the deprecated format? Will the selector still show the *previous/original* format?
no subject
As written (in the newly updated version of this PR), no, you'd be outta luck. (Unless you used a userscript/browser extension to send an old format type, which we wouldn't do anything to prevent.) My hope is that:
1. People mostly won't do this.
2. The consequences of oopsing it on one post won't be very large. For example, if you accidentally lost html_casual0 on a thing, you'd just have to make sure that post wasn't using @ signs in ways that look like user mentions, and escape them if so. The real hell of format updating happens when you need to update EVERYTHING -- updating ONE thing should be fairly mild.
If that DOES turn out to be a major annoyance, I bet we can think of some ways to deal with it. First one that occurs to me: store the "official" date when each format went obsolete, and when editing, add in every format that was still active at the time the post was originally made. Anyway, I think I feel fairly comfortable leaving that as a problem to solve if-and-when.
Are you trying to curdle all the milk in my fridge, or what. 😅
The idea of (best-effort secure clone) + (locked version of insecure original) is really interesting, thank you for thinking through that! (Also glad I'm not the only one who finds this an interesting family of problems to think through. 😄)
But I think the most likely scenario would be that, as a matter of ops policy,
- Replace vulnerable format with a best-effort secure clone, which probably has some edge-case inaccuracies.
- Fix those edge-case bugs as they're found, and count our blessings that it was a security bug whose long-term fallout was just "some old posts might have rare display glitches for a couple months."
Doing a broad sample from the database to check the quality of a clone seems pretty trivial, probably. "Baking" old posts into HTML... that actually sounds pretty gnarly. Actually, I was just reading about StackExchange's big Markdown processor changeover the other day -- did you know they dealt with that by running a 💀world-ending database migration that did sight-unseen automatic edits on every piece of user content ever submitted?💀 I think DW probably doesn't have the resources to attempt something like that, and a big HTML bake-fest is sort of a milder version of that same idea.
Anyway, good thinks, thanks for raising em.
(no subject)